Get ‘Messy’ Data In Order Before Implementing AI, Cautions Adeptia Exec

Adeptia offers a platform for organizations, including the midmarket, to get their unstructured data in order before integrating it with AI or other internal systems.

(Deepak Singh, founder of Adeptia)

As a midmarket IT leader, you have likely already deployed or are planning to deploy some form of AI in your infrastructure.

Before adopting any AI technology, it’s important to get your data in check, the founder of tech firm Adeptia says.

Adeptia helps prepare data for businesses that are planning to launch AI, or to streamline business processes.

Adeptia offers a “first-mile data service” converting unstructured data such as PDFs, for example, into structured data such as JSON files for easier integration with AI and other systems.

The Chicago-based company was founded in 2000 by Deepak Singh, who is currently the chief innovation officer.

Adeptia recently underwent a $65 million round of funding from equity firm PSG.

MES Computing spoke with Singh about how his company prepares organizations to adopt AI and achieve business transformation goals through data hygiene.

Can you speak about what Adeptia does?

We believe the biggest problem with making AI successful is not just the quality of the AI models, because the quality of the AI models now is very good, and you cannot even differentiate between o2 and o3 OpenAI models. What distinguishes the use of those models is the data that is used to augment the model to work with your business operations.

What we do is help companies access that data, what we call first-mile data, which is data coming from outside of your company, because that is all in different formats, different protocols. It is messy data, and it needs work to be prepared for use by AI.

We make this external data accessible for internal business operations and for AI.

Are you like a Mongo DB, or a database company?

No. What they do is store data. We don’t store data. We move data. We take data that is outside of the company, or even inside, in different applications. It’s siloed, and we help companies access the data and then convert it, prepare it in a very easy way, so that business users, analysts, can do this themselves, without writing any code, and they can then use that data in their business processes.

I’ll give you an example. One of our customers handles loan-processing documents. They get loan applications for business loans, and that data used to come in, still comes in these PDFs and scanned documents. That is really hard to work with, because they need to type that data into their internal systems.

We have a solution called intelligent document processing that converts that data automatically into a structured format, which is called JSON, which is used by APIs, and that data now can be loaded into a vector database to work as a RAG solution with AI, or into internal applications such as like SAP or QuickBooks or NetSuite, whatever applications they’re using.

They can take data that is basically scanned images or forms, and it extracts out all the different important data elements, and it becomes structured data that can be used more efficiently.

You’re preparing data to be used by companies that want to create AI solutions.

Yeah, it could be AI solutions, or it could also be internal operations. For example, they’re getting purchase orders that are coming in documents. Many companies get orders coming in [PDF], Word documents ... they track the data, like, who’s the customer? What is their customer ID? What product did they order? What’s the schema?

This information needs to be extracted from these documents so that it’s more consumable by applications and computers. We do that conversion.

There are apps out there that can take a PDF and make it into an editable text file. What do you bring that differentiates from that?

Those are applications that are called OCR. But what they do is convert the data into text. They do not convert them into a structured format. Structured format [is] where you have very well-defined structure for an order [for] example ... the order ID is this way, the numbers are this way ... your pages of SKUs, line items.

Our system handles that in a better way. For example, in the manufacturing space, there’s EDI [Electronic Data Interchange] standards. In health care, it’s HIPAA standards. These are direct standard formats. This data also needs to be processed to be usable by internal operations and with AI. [We] help make those data integrations more much more efficient.

In your dealings with customers, especially the midmarket, what are some of the biggest problems you see with unstructured data?

The big issue that companies have that makes it really difficult for them is to leverage this data is that it is constantly changing.

Data is very fragile.

The other big problem is that every time they add a new source of data, what happens is you have to set up those connections with that new source.

What we do, and what I think is really important, is to enable self-service, so that business users or analysts can themselves do this data ingestion.

They can themselves connect to a new customer or new broker or new partner and ingest the data.

Many of our customers have 3,000 data sources or partners or customers ... Fidelity has more than 10,000.

Your solution can extract data from any source, whether it’s a mobile device, cloud, on premise. Any endpoint?

You’re right ... any source, meaning it could be any format, like standard format, or it could be a non-standard format like CSV files or Excel files, or it could be unstructured formats like PDFs or documents. We handle any format and protocol.

Does the solution handle multimedia files ... video and images?

Because we are an enterprise software company, usually the enterprise data is still text, so we don’t actually handle video or audio. That’s something that we should put on the roadmap.

What is your best advice for IT leaders before they jump into deploying AI solutions?

They need to look at their data integrity and data quality and the ability to access the right type of data, as it relates to what we call first mile data, which is data coming from outside, because usually that’s the data you can’t control, and It’s messy, it’s unstructured, or, you know, it’s in different format, so that they having a control of [their] data hygiene.

They need to empower business users.

There should be a lot of focus on merging data with AI, and that is where there are different models of how to do that. One is called RAG (retrieval augmented generation). It allows you or any company to combine their own data with the LLM that’s already trained on universal data.

What are the security factors that you have in place?

Security is key, and especially when it comes to data that is HIPAA, you know, medical data, or PII, the personally identifiable information.

When companies combine this data, whether it is in a RAG kind of a model, or they use their data to train their own LLM, some companies take an open-source model, and they find what’s called fine tune it with your own data.

So instead of using RAG they use their own data to fine tune but what happens is that you have to anonymize the data that you’re using for training the AI and a lot of companies don’t know how to mask certain fields, like social security numbers.

There are data security best practices for just operational data even without the AI so that you keep control of who has roles and permissions and access. But when it comes to AI, I think there’s a whole other element of data security that many companies are not really aware of.

They think we can just take our data and put it all in the RAC [Real Application Clusters] database, vector database. And that should work, but actually it’s not a good idea to do that. They need to control who has access, because what if you put information to your HR data into the RAC database? Now, anybody in your company can say, ‘oh, you know, how much does a CEO make?’ You know, be able to spit out the answer, because the AI doesn’t know, it accesses all the data that is there. There have to be security best practices that need to be applied, even for the data that goes into a RAG or to fine tune the models.