AI Interview: Chunk Wisely To Avoid RAG Hell

"Almost any developer worth their salt could build a RAG application with an LLM, once they understand the basics of it," said chief product officer at DataStax, Ed Anuff.

"And then chunking hits and your results start to get really wonky. Now you're in RAG hell, and you have to go off and Google."

Most of us, we suspect, would have gone off and Googled well before that point. RAG? Chunking? RAG hell? We asked Anuff to explain.

Retrieval augmented generation (RAG) has quickly risen to become a standard feature of many natural language genAI applications. It uses a vector database as a halfway house between the user's prompt and the large language model (LLM). The prompt (or context) goes first to the vector database, which might contain only internal information from the user's organization. This then adds context to the prompt before sending it on the LLM to be processed. For example, it might say answers should be restricted to internal information only, or it could augment the prompt with relevant qualifying facts.

Thus enabled, LLMs produce far fewer hallucinations and answer with greater accuracy more contextual relevance. They can also say "I don't know". For example, imagine you have a genAI-powered ecommerce site that only sells TVs. You want the LLM to return information about the televisions you stock, not about other TVs, different electrical goods and definitely not about people with the initial TV, so you restrict the answers using RAG.

Chunking strategy

Chunking is a process that happens at the ingestion stage when documents are fed into the database. Models vary, but none has an infinite context window. Data must be broken into chunks before it is fed in, and the way you do that has a huge affect on what sort of answers will be returned.

Ideally, a chunk should be a discrete piece of information with minimal overlaps. This is because the vector database uses a probabilistic approach when matching the information it holds with the user input. The closer a vector matches the prompt the better.

"The chunk should reduce down to the most accurate vector possible," said Anuff.

Even when using systems that have huge context windows capable of swallowing dozens documents in one, it is usually still more efficient, quicker and more accurate to chunk the data. LLM response time and price both increase linearly with context length, and reasoning across large contexts is difficult.

The simplest approach to breaking up text is fixed-size chunking, splitting a document every few bytes or by character count, but as this takes no account of the semantic content it rarely works well. At the other end of the spectrum is an automated agent-based chunking, where machine learning takes care of the process based on context and meaning, but really it's horses for courses said Anuff.

"If I'm working with a legal firm I might have a bunch of lawyers working with my programmers to define how to extract the structure of a legal contract and chunk it. They can identify all the different variants, and you can build a set of rules and get very good results from that. That's called a domain-specific chunking strategy, and there are a bunch of specialized software companies that do this."

On the other hand, for the majority of cases where documents are more varied and less structured than legal contracts, you want to automate it with agent-based chunking, "where the agent looks at the document and says 'Okay, in this particular case, we'll want to break it up this way, and then this way'."

Frameworks and toolkits are emerging to do this, including unstructured.io, LlamaIndex and Langchain, all of which index data in different ways, such as semantic chunking and recursive splitting.

RAG hell

Chunking strategy is an evolving field, and as such it has become a bit of a bottleneck for AI development, Anuff said. The law of garbage in garbage out still applies, and bad chunking strategy will lead to poor results in a way that's very hard to understand. A new inferno has been born, joining dependency hell, callback hell and scope hell to create fresh torment for developers - RAG hell.

"We're talking to projects and they're, like, we're in RAG hell right now. What they've done is a naive implementation. They've done fixed size chunking or something out of the box, loaded up their data, and they're getting very bad results."

Avoiding descent into RAG hell means thinking carefully about the chunking strategy at the start, not after the fact.

DataStax, which offers its own vector database Astra DB, is partnering with a number of players and projects, including some of those mentioned above, to build a stack that will allow users to chuck in a document or, more likely, thousands of documents, and have them optimally chunked with a minimum of fuss.

"We're doing this because we're in the business of selling databases," Anuff said. "The sooner they get chunking done the sooner they can build applications and the more they're going to use our database."