How to Train Your LLM

last updated:

Most of the data you need is trapped inside all the company files, emails, messages, chats, media and more - scattered all over the enterprise. How do you get it?

Custom Large Language Model (LLM) projects are proliferating at a dizzying rate and to train one requires an enormous amount of enterprise-specific data. For example, let’s say you’re the IRS and you want to create a chatbot to answer questions about the contents of your numerous and dense tax guides. We can all agree that hallucination would be very bad here. So you cannot just deploy GPT, Bard, or the other foundational LLMs off the shelf. 

You ask your data science team to create an IRS LLM to give appropriate answers to the taxpayer, trained on your specific tax domain knowledge. But the natural language data you need is trapped inside 100,000 plus PDF files.

Problem: in its current state, that data is unusable for your LLM project. How best to bridge the gap?

I’m loathe to puncture some recently-floated marketing balloons – but here I go. The solution is not found with IDP, document management, ECM, content migration, or information governance software. True, each of these software products may contain some of the technology required to transform files in the wild into data. But in my estimation, they fall short of an end-to-end solution.

Data scientists and machine learning engineers live in another country and speak a different language. They need software that is fit for purpose. 

We recently spoke with a software vendor who is targeting this use case. Brian Raymond, CEO and founder of unstructured.io, the AI startup that just raised $25M, told us about the company’s founding mission. Brian and his co-founders come from ML engineering backgrounds and they speak Python and other data science lingo. 

The Unstructured platform is purpose-built to extract natural language data from many file types and transform that into purified text ready for the LLM project. I looked through the Unstructured API documentation (#productmanager4ever) and felt at home reading about PDF table extractors and OCR strategies. While the underlying technology may be similar to an IDP stack, what Unstructured does is much different from transactional, back-office IDP which usually processes one document type at a time. Unstructured’s platform has to process EVERYTHING from ANYWHERE, whatever the enterprise throws into its pipeline. And that takes special skills.

This is why Deep Analysis decided to add specialized data transformation companies such as Unstructured to our IDP coverage. LLM project managers may find themselves in the IDP pool after Googling subjects such as “OCR engines” and “document transformers” and we’re happy to send them into the deep end where the unstructured data transformation specialists swim. Unstructured.io is one to watch in this space.

How did that IRS story end? Check out the Unstructured blog for the rest of the story. 

Leave a Comment