In an AI-driven software industry that places supreme importance on consistent labeling, oddly enough vendors and analysts can’t seem to settle on a consistent name for the “stuff” the software manages.
This is especially true in the intelligent document processing (IDP) market. I met with three leading IDP companies in London last week, and each one used a variety of labels to describe this stuff. Documents, content, unstructured data, semi-structured data, files, and even that old standby, records.
The vendor marketing veers all over the place. Sometimes in the same slide, web page or white paper, a document morphs into content, and on the very next page it has transformed miraculously into unstructured data. Without a glossary to help the uninitiated to keep score.
This is an industry jargon problem that we created over 30+ years of managing our stuff on computers. The vendors are not the only ones: the analysts are just as much to blame. (Please don’t be too harsh on us; we must have taxonomies to order our world.) Unfortunately, the shifting jargon only confuses the very people that matter most in the end: paying customers.
I suspect this marketing word salad is a consequence of the massive FOMO (fear of missing out) infecting the C-suite. If you could be a fly on the wall of a software company’s exec team meeting, the fear might sound something like this:
- Too Big to Fail Analyst is now calling it “content”; change all the messaging!
- If we only say “documents”, will we miss being shortlisted for RFPs asking for a “content” solution?
- We’re getting beat by a competitor whose website calls it “unstructured data”!
- Are WhatsApp messages documents, content, or files? Let’s use all three so we don’t miss a sale!
I’m pleased to announce it’s OK, you can stop trying to square the circle and stop stressing over which label is best. Because none of this will matter any more.
My 95% confidence score is informed by two assumptions.
First, GPT and other foundational LLMs do not care what generic label we use for the “stuff” we’ve given it to understand and analyze. An AI model does not distinguish between a “structured”, “semi-structured”, or “unstructured” document/content/data/file; that’s a human way to categorize our stuff. Whatever you send it, AI breaks it all down into machine-digestible components of text, layout, image, page count, etc.
Second, I think users are firmly on the side of AI, not vendors and analysts. Users don’t care what WE call it.
I love customer case studies because they transcend this fake labeling. On Deep Analysis briefings, I am infamous for my impatience: enough with the content-semi-document-structured pitch, just get to the damn document! Only then do we learn what the AI software is actually working with: a contract; a purchase order; a mortgage loan file; an MRI; an invoice; a bank statement; an MSA; on and on.
This is when it all becomes real. The end user knows exactly what stuff they are struggling to automate, and they do not call that “unstructured data”.
I know my little rant won’t change the industry’s buzzword problem any time soon, or even change my own tendencies to flip-flop between those labels. And I”m not here to advocate for a unity buzzword. I just feel better now that I know our buzzword labels do not matter to the AI or the end-user.
So to all my industry friends, pay special attention to what your customers are calling it. For they care only about one thing: can you please automate my (insert the actual name of the thing here)?!?