Why ‘RAGs to riches doesn’t work’ - structuring data instead of dumping embeddings

The promise of RAG

Feeding the prompt of an LLM with query and company -specific data located with an RAG (Retrieval Augmented Generation) pattern is the go-to approach for a lot of developers wanting to create AI systems that understand the company context and can respond based on specific documents not just generic web or training data. It is very straightforward to implement basic RAG systems, either building from ready components or integrating a ready solution. For many product demonstrations and simple applications the result looks very convincing - the AI answers clearly draw from documents stored in the RAG database.

In more complex applications you need to tune the RAG setup by tweaking the way the data is chunked, embeddings calculated, how retrieved documents reranked, and so on. This makes the responses more relevant and ensures the model finds more relevant sources to quote from. Since RAG systems are a relatively new invention (the first paper was written in 2020), there are a lot of ongoing developments to all parts of the flow and entire communities (e.g., Reddit's /rag community) dedicated to learning what works and what does not.

The limitations of RAG

However, what many developers are starting to realise is that the fundamental paradigm of a RAG model limits its usefulness in many applications. If you want to humanise what a RAG system does, think of an analyst sitting in a crowded office cubicle filled with thousands of two-sided post-it notes everywhere. The front side has a short code (embedding) describing the longer back side. Every time you ask them a question, they scurry around to find as many relevant post it notes as possible, read the back sides through and then answer based on that. The analyst has a great scheme for identifying post-it notes at run time based on matching what you asked with the short codes. However, every time you ask something a bit differently, they look for different short codes and consequently come back with different notes. And you never know if they read through all the right notes or just picked a few obvious ones. Maybe there was a key insight in one of the back sides, but it was never read and thus the answer is based on mediocre sources only.

Now, you could try to overcome this by increasing the context length of the model and then feeding it more tokens by being more generous on what notes the analyst retrieves. In some applications (say analysing 100+ pages of interview notes) you could even skip the RAG part and just ask the analyst to read all the notes through each time you ask it something. Now, in this case what tends to happen is that all the unnecessary facts clog the brain of the analyst as they try to make sense of which of the data is actually relevant. The quality of the answers degrades with more tones and many LLMs seem to overweight the first and last tokens in their answers. And the answer is still a black box answer as you do not know if the analyst used the most crucial insights to shape their answer.

Why ‘RAGs to riches doesn’t work’ - structuring data instead of dumping embeddings

Developers are starting to realise that even after optimising embeddings, chunking logic, reranking and models, RAG (Retrieval-Augmented-Generation) falls short in many real world applications

The promise of RAG

The limitations of RAG

The birth of Skimle

Closing thoughts