Documents, content, files, records, semi-structured or unstructured data: do labels really matter anymore?

last updated:

Documents, content, files, records, semi-structured or unstructured data: do labels really matter anymore?

by:
last updated:

An AI model does not distinguish between a “structured”, “semi-structured”, or “unstructured” document/content/data/file; that’s a human way to categorize our stuff.

In an AI-driven software industry that places supreme importance on consistent labeling, oddly enough vendors and analysts can’t seem to settle on a consistent name for the “stuff” the software manages.

This is especially true in the intelligent document processing (IDP) market. I met with three leading IDP companies in London last week, and each one used a variety of labels to describe this stuff. Documents, content, unstructured data, semi-structured data, files, and even that old standby, records.

The vendor marketing veers all over the place. Sometimes in the same slide, web page or white paper, a document morphs into content, and on the very next page it has transformed miraculously into unstructured data. Without a glossary to help the uninitiated to keep score.

This is an industry jargon problem that we created over 30+ years of managing our stuff on computers. The vendors are not the only ones: the analysts are just as much to blame. (Please don’t be too harsh on us; we must have taxonomies to order our world.) Unfortunately, the shifting jargon only confuses the very people that matter most in the end: paying customers.

FOMO?

I suspect this marketing word salad is a consequence of the massive FOMO (fear of missing out) infecting the C-suite. If you could be a fly on the wall of a software company’s exec team meeting, the fear might sound something like this:

  • Too Big to Fail Analyst is now calling it “content”; change all the messaging!
  • If we only say “documents”, will we miss being shortlisted for RFPs asking for a “content” solution?
  • We’re getting beat by a competitor whose website calls it “unstructured data”!
  • Are WhatsApp messages documents, content, or files? Let’s use all three so we don’t miss a sale!

Good news

I’m pleased to announce it’s OK, you can stop trying to square the circle and stop stressing over which label is best. Because none of this will matter any more.

My 95% confidence score is informed by two assumptions.

First, GPT and other foundational LLMs do not care what generic label we use for the “stuff” we’ve given it to understand and analyze. An AI model does not distinguish between a “structured”, “semi-structured”, or “unstructured” document/content/data/file; that’s a human way to categorize our stuff. Whatever you send it, AI breaks it all down into machine-digestible components of text, layout, image, page count, etc.

Second, I think users are firmly on the side of AI, not vendors and analysts. Users don’t care what WE call it.

I love customer case studies because they transcend this fake labeling. On Deep Analysis briefings, I am infamous for my impatience: enough with the content-semi-document-structured pitch, just get to the damn document! Only then do we learn what the AI software is actually working with: a contract; a purchase order; a mortgage loan file; an MRI; an invoice; a bank statement; an MSA; on and on.

This is when it all becomes real. The end user knows exactly what stuff they are struggling to automate, and they do not call that “unstructured data”.

I know my little rant won’t change the industry’s buzzword problem any time soon, or even change my own tendencies to flip-flop between those labels. And I”m not here to advocate for a unity buzzword. I just feel better now that I know our buzzword labels do not matter to the AI or the end-user.

So to all my industry friends, pay special attention to what your customers are calling it. For they care only about one thing: can you please automate my (insert the actual name of the thing here)?!?

4 thoughts on “Documents, content, files, records, semi-structured or unstructured data: do labels really matter anymore?”

  1. Hey Dan

    While I understand and agree with your points to a large extent, we do have to remember that in the end we have to communicate concepts to the people that buy stuff, to process their stuff. We can call everything stuff, but that might not be helpful (or it might!). We have dodged around this topic for 25 years (or more?), or at least that is how it feels to me – Enterprise Content Management versus Electronic Document & Records Management anyone… “well actually EDRM is a sub-set of the functionality of ECM don’t you know!”

    I used to have a deck on KM that referenced an academic paper that had 250 definitions of ‘knowledge management’. In the end I think it boils down to this – what does the organization call its stuff? If they call it content, then we call it content. If they call it unstructured data, then so do we. If they call their stuff ‘knowledge assets’ then that is what we shall call it too….. as consultants, or vendors we have to listen to the user / prospect / customer and use the language they are comfortable with. If they use ISO standard based definitions, then who are we to tell them to do it differently, we just go with the flow…. ???

    • Thanks for the thoughtful comments, Jed. And for supporting my conclusion: “pay special attention to what your customers are calling it. For they care only about one thing: can you please automate my (insert the actual name of the thing here)?!?”

  2. I love this article for a number of reasons. While at Parascript, we rarely had a prospective customer ask if we could process structured, semi-structured, or unstructured documents – it was always within context of a specific business process which indicated specific documents.

    Starting with information taxonomies means talking about a solution and not the customer problem.

    • Thanks Greg, you and I have talked about the jargon problem more than a few times in the past!

Leave a Comment

We use cookies to ensure that we give you the best experience on our website. If you continue to use this site we will assume that you are happy with it.

Work Intelligence Market Analysis 2024-2029