Atop the LLM Leaderboard

last updated:

Atop the LLM Leaderboard

by:
last updated:

Indico Data tees up new LLM benchmarking site

With the PGA Championship teeing off tomorrow, I just couldn’t resist the metaphor. 

Last summer we recognized Indico Data as one of the “adults in the GPT room” for its considered and reasonable approach to deploying LLMs into enterprise document automation projects. Founded in 2014, Indico was part of the 3rd Wave of IDP (machine and deep learning); but at the core it’s really a 4th Wave company working on large language models (LLMs) from the start. Co-founder Alec Radford published a seminal generative AI paper way back in 2015, showing that Indico are not new kids on the LLM block.

So, it comes as no surprise to us that Indico would be one of the first to offer a benchmarking site that rates various LLMs for document understanding tasks. The vast majority of LLM benchmarking to date has focused on chatbot-related tasks. As a document automation specialist, Indico saw the need to analyze LLM performance against more deterministic tasks such as extraction and classification, and to examine performance and costs based on assumptions related to context length and task complexity. The team briefed us today about their goals for the project.

Announced on May 13 and publicly available now on the company’s website, the LLM “Leaderboard” does exactly what the name implies: ranking the player scores across essential IDP tasks such as data extraction, document classification, and summarization. The Leaderboard can be sorted and analyzed in several ways. The LLMs tested include LLama, Azure OpenAI, Google, and AWS Bedrock. Indico has fully disclosed the list of datasets (CORD, CUAD, Kleister NDA, Kleister Charity Financial Reports, ContractNLI and others) as well as the prompting styles used for testing. Indico has also cleverly included its own discriminative standard language models based on RoBERTa and DeBERTa.

The Old Pros are in the Lead

When they showed us the Leaderboard, what immediately stood out was that the top 2 models for data extraction accuracy were not LLMs, but Indico’s own trained models.

This data validates what we’ve been reporting since the beginning of the IDP LLM experiment last year: good old machine learning models fine-tuned for a document type are still the most accurate, fastest, and lowest-cost choice for a data extraction operation.

Data Extraction

We did a deeper dive into data extraction accuracy, which is always a concern for an IDP project. We sorted the results by a prompt style titled “long context extraction with instructions first”, across all the datasets, and ranked by F1 scores (with 1.0 being the highest possible score).

Indico’s own discriminative models were the clear leader by far. Take for example the NDA dataset consisting of thousands of long-form unstructured NDAs: the RoBERTa and DeBERTa models both have respectable F1 scores over 0.8, ready to be tweaked for production deployment; conversely the closest LLM, Anthropic’s Claude V2, was far down the ranking at 0.38. Our takeaway is it’s not a brilliant idea to deploy an LLM for this task.

Document Classification

When it comes to classifying documents using prompts without any description or rationale, the LLMs show their superiority. This operation is similar to the zero shot “magic” that dazzled us all in 2023, when we fed different document types into Chat GPT and saw it accurately classify each one with no prior training. For the dataset, we selected Resource Contracts, a repository of unstructured oil, gas, and mining contracts.

GPT-4 and Claude both achieved a perfect F1 score, while Gemini Pro clocked a respectable 0.867. The RoBERTa and DeBERTa models are not even on this list. But watch this: if we tweak the prompt style to “classification with a description”, the RoBERTa model rockets to the top with a perfect score and LLMs are far down the list. The moral of this story? A little prompt engineering goes a long way and could save a lot of money for your project.

When it comes to document summarization, LLMs far outpace the discriminative models. You can check that result and more for yourself here: https://indicodata.ai/llm/. The team told us they plan to update the Leaderboard every 90 days.

Conclusion

Computers still struggle to achieve acceptable accuracy on many document types and end users are overwhelmed with the many available AI choices to solve this. The wrong choice could cost more money and waste valuable time. Indico’s LLM Leaderboard is a welcome tool that can help a corporate AI team narrow down its selections based on specific use cases. 

Chapeaux to Indico and any other software company that freely and openly contributes good data like this to the public conversation about LLMs. (We have also covered Rossum’s benchmark for transactional business documents.) 

Leave a Comment

We use cookies to ensure that we give you the best experience on our website. If you continue to use this site we will assume that you are happy with it.

Work Intelligence Market Analysis 2024-2029