Finding value within mountains of unstructured data is both the challenge and the opportunity. Organizations have amassed millions, and in some instances billions, of files over the years. They pay heavily to store them as they believe that hidden in that mass of files are some of value. For sure there is value hidden within the mass, but there is also a high likelihood that there are files there that could cause organizational damage.
Identifying the valuable files from the problem and junk files is a significant challenge; most organizations have no idea what they have. Many tools and services are available in the market to help with this, yet most work by reading the text in the metadata or file itself and while they work quite effectively, they struggle with very high volumes. These tools, even in conjunction with good information management practices, work well but there is room for much improvement.
An AI Approach to Information Management:
DocAuthority is a startup that has developed a new, patented, AI-based approach called BusinessID™ to do just that. In our discussions with the firm and in viewing a demonstration of its product, it is clear that there is much promise in what it can do. At the same time, we left our discussions with some questions as to how it works. In short, the DocAuthority system is used to identify and protect sensitive data automatically, primarily a search style DLP (Data Loss Prevention). The identified information is categorized, based on an understanding of the data’s business meaning and context. DocAuthority can be used, for example, in many different industries to identify and protect sensitive information, forming the basis for designing a practical Information Governance model.
Real World Challenges:
However, how it does this differently to traditional methods remains a bit of a head scratcher. What we do know (having sourced and read the original patent) is that it works through a form of pattern matching. The system extracts features from the documents, creates a vector based on those features, then matches for similarity in future documents against the feature vectors. Where we are still somewhat in the dark is how to configure those vectors to meet specific requirements asDocAuthority told us that the system does not need to be trained as it has already been pre-trained. What we see is an AI-driven faceted search engine, for primarily Word, Excel and PDF files held in File Shares ,SharePoint, and Office 365. One that does a good job of the initial analysis and tagging, one that can run across multiple repositories and aggregate/federate the results. Combine that with some prebuilt taxonomies, and you have a tool well worth looking at if you are facing multiple silos of unidentified data, as it primarily builds an auto-classified information hub. By default from that hub, there is the potential here to automate records management on scale.
Our issue is that it is all a bit generic and vague; this may be because how DocAuthority actually works is hard to explain. However, as the firm has already patented the underlying approach, it would, in our estimation, be of great value if it could articulate more clearly how it differentiates from traditional approaches. In future releases, we would also like to see more automation of the delivery of insights, alerts and actions to the end user, rather than just in the identification of data. All in all though, it’s a pretty good product, and a timely product as more data protection regulations roll out. Finally, it is worth noting that it refreshing to see new approaches to AI being used for document and records management requirements, as traditional approaches have struggled to scale and truly deliver on their promises.