At the NYC leg of its constantly traveling caravan of product announcements – World Tour – this week, Salesforce announced upcoming enhancements to its AI product strategy, primarily concerning support for unstructured data. Mainly focused on its Data Cloud (formerly Genie) and Einstein1 (formerly AI Cloud) – existing reports on both of which are available to subscribers – these announcements will enable existing organizational knowledge managed in both applications and storage, to be processed into a vector database, which will form part of the grounding service for its various Copilot-branded AI services, which becomes Einstein Copilot Search in the process.
Let’s break this down a bit. As we’ve mentioned in the past, AI without access to your organizational data isn’t anywhere near as useful as it could be. This need to contextualize, especially generative functionality, has meant that vendors – including Salesforce – have been concerned with building specific architectures to allow this flow to happen without the whole thing feeling unnecessarily complex or frightening whoever looks after compliance half to death.
That balance has required a gentle reveal of the underlying architecture and the gradual onboarding of potentially huge sets of knowledge-rich data. In this set of announcements, Salesforce indicated that in select customer pilots starting in February 2024, unstructured data in the form of what it calls “Knowledge Sources” (applications) and “BLOB Stores” (cloud database stores of unstructured data) can be ingested into a new vector database within Data Cloud, where it will form part of a complete data graph alongside any structured data already onboard. Both are then available for the grounding/Copilot Search service to augment prompts with context (including running similarity queries) as a form of (and time to grind your teeth at this point, long-standing search tech people) semantic search.
If this feels like a full Retrieval-Augmented Generation (RAG) model, then you’re right; it is. That’s mostly what Salesforce’s grounding process has been based upon from the get-go. And not just theirs, Microsoft’s too (that’s a good explanation of RAG should you wish to dig further into it and learn how it compares to Fine Tuning), including semantic search. Don’t be surprised if the guts of these processes remain hidden behind some opaque glass labeled “grounding” for the foreseeable process, lest it tip that gentle balance too far toward the “unnecessary complex.”
The unstructured data here – as those teeth-grinding search tech people will have been shouting for several paragraphs at this point – is vital to making the semantic search part work as intended. The relevancy of what is returned will determine how successful or otherwise this entire process is enabled to become (and if you’re unfamiliar and have a passing interest, time to learn about Precision vs Recall). Turning on the taps to enterprise volumes of unstructured data makes sense in embracing the entirety of organizational knowledge; it’s all there. But (puts on an optimistic hat at this point) in that “all there” is likely to be a great deal of less useful and out-of-data information, the auditing of which will need to form an active, engaged part of these pilot projects for them to be successful.
In the preview slides shown to analysts before the formal announcement, 3rd party “Knowledge Sources” included logos for OpenText, Atlassian’s Jira, Box, Google Drive, Microsoft SharePoint, and – for reasons we can only assume are pure nostalgia – Jive. Let’s take those as a representative set of what could feasibly be present as knowledge sources in a single organization (hat here again worn at a suitably jaunty angle). It’s optimistic to think that you’d be able to treat all as immediately suitable for use. That same slide deck refers to wrangling these application data sets into Data Cloud as “harmonize,” suggesting that they buy their hats in the same place I do.
However, it provides an important insight into what Salesforce and their peers have already recognized as necessary territory for expansion within their customers’ and prospects’ application estates. The activation of unstructured data underpins the ultimate productivity of investments in AI for organizations, and they’ll likely look to do so under the auspices of a single, standardized platform. For a moment – perhaps a brief one – might those knowledge sources and their big-suite partnerships become more than just logos on the page?
As a firm that shares a collective lifetime or two in the world of unstructured data, we naturally approve of the arrival of Salesforce in our backyard (if not to ask what took you so long?). However, as a firm that also calls their podcast “We Love Ugly Data!” – we’re equally aware that what you find when you peer into what customers are paying lavishly to store in applications and/or cloud storage is very often not in any state where it could be the immediate aid of AI as you might hope for.