At Deep Analysis we are closely monitoring the aftermath of the ChatGPT hype event of November 2022. ChatGPT has catapulted previously obscure AI technologies such as large language models (LLMs) and generative AI (GenAI) models into the mainstream of enterprise software. We tirelessly watch the smoke cloud of hype, hoping to spot the emergence of innovation that will bring real benefits to users. It’s a bit like entering the Wizard of Oz’s lair and fighting through the smoke cloud to discover what’s really behind the curtain. This is not easy; daily we are inundated with product news touting generative AI features.
I cover Hyperscience in our IDP research practice. I wrote the Deep Analysis vendor vignette in March and in my May 4 2023 analyst note I cited Hyperscience for its prudent, cautious approach to generative AI. So naturally I was curious to read its recent blog titled Using Out-Of-The-Box Generative Document AI Models. I assumed the story would be similar to GenAI product announcements from competitors. I was wrong.
A sentence in the blog literally stopped me in my mouse tracks: “Building on the research titled OCR-free Document Understanding with Donut, our ML team began researching No-OCR Document Understanding models…”
Wait! OCR-free document understanding?
From the beginning of the First Wave of IDP, OCR has been THE fundamental technology underlying every product. If true, I knew this would be a milestone innovation for IDP. I spoke with the Hyperscience product team to find out if this was more hype smoke, or real. Here is what I learned from the people behind that curtain. (Spoiler alert: it’s not smoke.)
So what – who cares?
Before we go any further, why is this important? Why would any business process owner care whether or not the IDP solution uses classic OCR or is OCR-free? We’ve all had enough of hearing about new AI tech, for tech’s sake. Show me the money.
The ML engineering team at Hyperscience framed the existing IDP problem quite well:
The current state of IDP is that the processing pipeline relies on a cascade of machine learning models. First, you transcribe the content in a document from the image using a process called optical character recognition (OCR). Then, you classify the processed image by placing it in a category. Finally, you extract the relevant data for use in other systems. These three steps require multiple machine learning models to process a document end-to-end. Furthermore, current extraction models are usually trained for a specific set of fields and a specific type of documents—the more document types there are, the more extraction models required. As a result, the pipeline for processing multiple document types increases in complexity.
There are several drawbacks to using multiple models, including:
- Compounding Model errors: If model A makes an error, that error will be passed onto model B, which in turn is likely to introduce additional errors
- Model management overhead: Each model must be trained independently along with the overhead of having to manage them—this is a significant resource drain
- Processing times: Multiple models can mean more latency, equating to more work to optimize for processing speed
(source: Hyperscience hackathon blog)
While that may be a bit techie for a lot of readers, trust me, this is a very big thing for IDP. Before we look further into how Hyperscience is using this in the product, a brief background on this Donut thing.
What is a Donut (AI)?
In November 2021, a team of machine learning researchers at Naver Clova, a Korean software and cloud services company best known for its product recommendation AI for Android phones, published a paper titled OCR-free Document Understanding Transformer. Clova also developed a mobile OCR AI for receipts, invoices and other docs using the classic OCR approach bolstered with NLP and computervision AI models.
To address the drawbacks of using multiple AI models, the Clova team created a novel OCR-free visual document understanding model they nicknamed “Donut“, which stands for DOcumeNt Understanding Transformer. (source: OCR-free Document Understanding Transformer paper, Geewok Kim et al.)
How does it work?
I read the paper and attempted a summarization. Then I decided that Hyperscience’s product team had already written a good explanation in that first blog, and here it is.
Donut is a visual document understanding (VDU) model that receives only an image of a document and a prompt with details of the task in order to generate a desired response. The model architecture itself (Fig.1 ) is composed of a visual encoder called a Swin transformer and a text decoder (BART) that has been pre-trained on a full-page transcription task. While multi-modal LLMs (such as GPT-4) can generate arbitrary text from an image, Donut generates a narrower response in a templated format, which can be transformed into a structured format such as JSON.
Fig 1. Donut architecture (courtesy of the Clova Donut research team)
How did Hyperscience come to love Donuts?
When I was a software product manager, sometimes I would bring a box of Krispy Kreme donuts to help fuel the developer sprint marathons. Who doesn’t like donuts? In 2022, the Hyperscience ML engineering team experimented with Donut during the company’s annual team hackathon event. Keep in mind, this is a company that developed its own OCR engine with leading handwriting recognition, a foundation of its product line today – and yet was willing to disrupt itself.
One of the team members shared this experience in a blog, where he concluded that, “whereas the typical document processing pipeline involves many different machine learning models, Donut processes a document end-to-end in a single pass.”
Boom.
It really helps to look at the Before and After workflow. Using images provided by Hyperscience, you can visualize the dramatic reduction in processing steps.
Fig 2. Workflow before Donut
Fig 3. Workflow with Donut
For some use cases, users stand to gain several benefits from this slimmed-down implementation.
- Training and managing a single ML model is just easier. Donut reduces AI model management complexity by replacing several OCR AI models with one.
- Text errors are practically eliminated. Processing raw documents with a single model reduces the potential for cascading recognition failures from multiple models.
- Everything happens faster. Software marketers like to call this TTV (Time to Value), the elapsed time between project start and in-production. Donut is like a zero-shot learning model, which means far less sample training.
Too good to be true?
Is this the end of classic OCR as we know it? I was skeptical, if only because donuts have holes in the middle. I asked the team to compare “No OCR” with the Donut model to their current “OCR” extraction and classification models.
- Donut, like other GenAI models, can hallucinate. The Hyperscience team stressed the importance of pairing No OCR with its human-in-the-loop (HITL) supervision to keep a close eye on the output.
- Donut cannot replace every feature of OCR extraction & classification models. For example, with OCR you can retrieve the coordinates of an extracted text segment. This is used to draw bounding boxes around that section of the document for a richer user experience, such as highlighting the text or applying redaction. The Donut model alone cannot do this.
- Donut also requires more computing power. The good news is that the Donut model was trained on around 200 million parameters, a fraction compared to GPT-4, LLama, Bard and other foundational LLMs, so its computing hunger is much less. The model training process does require a GPU, prices of which are expected to drop quickly as the market ramps up for GenAI. The models themselves should be able to run in production on a CPU.
- How about performance? Hyperscience team has run internal tests and they found that – in general – No OCR outperforms the OCR extraction and classification models.
To summarize, the Hyperscience team envisions a need for both approaches, depending on the use case. As it further develops the use of Donut, Hyperscience will also continue to enhance and improve its current OCR AI models.
Donut compared to other VDU models
Some readers may be familiar with LayoutLM and AWS AI DocFormer, other recent VDU methods. I asked the team for help comparing Donut. The key difference is that both DocFormer and LayoutLM utilize a combination of the raw image input alongside the OCR results from an upstream OCR engine.
In comparison, Donut is an end-to-end model that takes in the raw document image input and returns plain text without reliance on an upstream OCR engine. DocFormer and LayoutLM are examples of the “legacy” way of doing document processing, and the future is headed towards one end-to-end model.
In terms of benchmarking, it’s hard to do an apples-to-apples comparison because the LayoutLM/DocFormer performance relies also on the OCR engine performance. But so far, Donut seems to have the upper-hand in some extraction benchmarks, and still lags a little behind in terms of pure document understanding (because of the way it’s trained).
How soon can I have a Donut?
Hyperscience has posted a demo video of No-OCR models for four document types: passports, checks, driver’s licenses, and invoices. You can find that in the blog I referenced above. The team says it has just begun a closed beta test with some existing customers and thinks this could be released to market in the near future.
After fifty years of OCR engines ruling the document processing roost, we are at the beginning of an exciting new phase for data extraction and classification. What will this mean for the legacy OCR engines, and how it will play it out at the coal mine face where end users are facing workaday document automation problems? Stay tuned, Deep Analysis will continue to track the progress of this innovation, along with other generative AI breakthroughs from the IDP industry.
Post-script: I would like to thank the following members of the Hyperscience product team for their patient explanations of complex AI concepts, and for graciously answering all my questions.
- Priya Chakravarthi, Director of Product (and author of the blog that kicked it all off)
- Ceena Modarres, Senior Engineering Manager, Machine Learning
- Jorge Ruiz, Director of Product Marketing