Deep Analysis | Analyst Firm | Vendor Vignettes

Haystac


Founded 2014 | HQ Boston, MA | 15 employees | >$1M annual revenue

Haystac is unique because it offers a unified platform for both content analytics and cognitive capture. The company could disrupt the cognitive capture market by applying content analytics AI methods to classify and extract data, and its “librarian robot” approach could help organizations maximize their ROI from current and new RPA investments.


The Company

Haystac is a content analytics software company whose mission is to organize unstructured data for its customers to make the data more findable and usable. The company was founded by AI experts and data scientists, and its software platform makes extensive use of machine and deep learning models to process unstructured data and produce a clean, organized data lake. The data is then ready for consumption in several use cases, including information governance, compliance and risk management, data privacy, content migration, and intelligent process automation.

This is significant because since the 1990s, it has been estimated that up to 80% of data in organizations is unstructured data, meaning it is not managed and stored in database tables. Examples are documents, spreadsheets, presentations, scanned images, PDFs, emails, text messages, social media posts, web pages and every other content type the average company produces and stores. This data lives on shared drives, cloud storage, local hard disks, and legacy document management systems; it is extremely hard to manage and control, and the problem is only getting worse as the amount of data grows exponentially every year. Left alone, unstructured data represents a serious threat to the company’s reputation and deprives it of benefiting from the data trapped within these files.

Over its first five years, Haystac invested heavily in research and development that produced today’s cloud platform that can scale to manage petabytes of data and process over 500 file types including scanned images. The platform is targeted at what Haystac calls “data shepherds,” the people within an organization who – regardless of various titles such as business analyst, subject matter expert, records manager, privacy manager, chief data officer, or other – are tasked with classifying and labeling data. There is no prerequisite for data science or AI training in order to use the product.

The Technology

Haystac’s flagship product is Indago for Enterprise, a content analytics platform that runs in the cloud or on-premises. Indago’s main purpose is to create, train, and manage AI models without requiring the customer to write code or possess any special AI knowledge. The models (described by Haystac as “librarian robots”) are deployed to crawl through an unstructured data space and create structure out of the file disorder. The system is scalable, with Haystac reporting some customers having processed over 20 petabytes. Unlike some content analytics products, Indago leaves the files in place and does not need to convert native files into a new format.

Training a model is straightforward; Indago was designed to minimize the training time and number of samples with a classic file/folder user interface. It applies a variety of AI machine and deep learning algorithms to automatically organize the files into clusters based on visual and textual similarities. Indago includes several machine learning and deep learning algorithms and is smart enough to decide which is best for a given file.

To better suit the wide diversity of unstructured data types and sources found in a typical enterprise, Haystac developed multiple AI technologies. Its goal was to adopt AI methods to the native data, rather than forcing the expensive and sometimes inaccurate process of data transformation used by other vendors.

This process also has the net effect of finding samples to train classifiers for semi-structured and unstructured documents, reducing time and hassle as usually hundreds or even thousands of samples are required.

Depending on the use case, Indago then performs some or all of the following tasks:

  1. Classify the files into categories; e.g., this is an invoice, that is an employment agreement, etc. Haystac uses proprietary AI visual recognition methods to distinguish documents.
  2. Separate large consolidated files into their individual documents. For example, from a 200-page loan file PDF, separate out the uniform lending form, bank statements, appraisal, quitclaim deed, etc.
  3. Extract data from scanned documents. First, Indago uses Haystac’s proprietary visual anchoring method for data extraction (see Figure 1). This method does not rely on OCR and therefore may do better with poor quality images or documents with variable layouts. If needed, Indago then applies OCR in a second pass to improve results. Haystac uses a variety of open source OCR engines depending on the use case, or the customer can substitute another OCR engine.
  4. Identify redundant, obsolete, and trivial (ROT) files and set them aside for disposition. Haystac says some of its customers found up to 50% of unstructured data was ROT, before content migration to the cloud.
  5. Create metadata for use in retention schedules.

Haystac was an early innovator at deep learning research for content analysis and continues to add or augment Indago functionality with pre-built deep learning models. The main goal is to continually reduce the time and number of samples required to deploy a model into production and to improve the model’s learning on the job. Here are some examples of how Haystac uses deep learning:

  • Pre-built deep learning model for document boundary detection. This is Haystac’s proprietary method to separate unstructured documents of variable length and composition. Using transfer learning, the model continues to learn as it accumulates domain-specific knowledge at the customer.
  • Transfer learning for text analytics. By identifying and learning text document types in one domain (e.g., financial contracts or company reports), the accumulated knowledge can be transferred for use in another domain such as real estate or manufacturing. This becomes the initial knowledge base to train new models and should substantially reduce training time and samples.
  • Deep learning classification model for scanned images. The goal again is to simplify the model training and move the accuracy rate closer and closer to 100%. Haystac showed us how Indago can train on as few as five samples for a complex mortgage document.
  • Pre-built deep learning model for information extraction from semi-structured documents (invoices, claims forms, tax forms, etc.). Out of the box, the model will work with many of these documents and will continue learning on the job.

Figure 1
Haystac’s Proprietary Visual Anchoring AI Method for Data Extraction

Our Opinion

Haystac is unique because it offers a unified platform for both content analytics – typically used on the risk management side of the enterprise for information governance and data privacy – and cognitive capture, used mainly on the business process side of the enterprise where intelligent automation lives.

The company could disrupt the cognitive capture market by applying content analytics AI methods to classify and extract data. Haystac told us about a customer that chose Indago over established capture vendors to classify almost one billion documents scanned from file boxes and to extract data for records retention policies. This was, in our estimation, one of the largest capture deals of 2020.

Some organizations that invested in RPA hit a roadblock after the pilots and after automating the low-hanging fruit; they are looking for ways to scale their sizable investments in RPA. Haystac’s “librarian robot” approach will be of interest and could be positioned to help them maximize their ROI from current and new RPA investments.

Advice to Buyers

Haystac is, arguably, a “category-buster” that does not fit easily into the usual technology buckets. We see Indago as viable for more than one use case. If you are looking for a platform to organize petabytes of unstructured data for use in information governance, compliance, or migration, then consider Haystac for your short list. For those with a cognitive capture or intelligent automation project that will ingest semi-structured or unstructured documents, take a close look at Haystac’s unique classification and extraction capabilities. The company’s customer references include BPOs, insurance companies, and banks.


SOAR Analysis

Strengths

  • Practical application of deep learning research and development
  • The data scientist team has worked together for more than 10 years

Aspirations

  • Automate the AI model training process
  • Expand quickly through an investment round or merger

Opportunities

  • Market Indago to RPA vendors and customers
  • Articulate the tangible, quantifiable benefits of using deep learning, a technology still largely misunderstood

Results

  • Created the first hybrid information governance/cognitive capture platform
  • Closed one of the largest cognitive capture deals of 2020