RAG is not the same as vector similarity search

March 26, 2024

Not all AI tools are created equal. Let’s take a closer look at what RAG can (and can’t) do for you.

Is it just me, or is everyone talking about RAG right now? RAG, or Retrieval Augmented Generation,  is an approach to interacting with large language models that has been gaining significant attention in recent months in the world of software development. And for good reason! This approach offers a practical solution for leveraging large language models (LLMs) in applications with specialized or private domain knowledge, allowing for quick market deployment by engineering prompts instead of fine-tuning the model.

One of the key reasons for the popularity of the RAG approach is the limitation imposed by relatively small context windows (<100k tokens) in most LLMs. This constraint necessitates a pragmatic approach towards crafting prompts, as only a few "relevant" documents from the knowledge base can be included.

Vector embeddings have emerged as a practical technique to represent the semantics of data, such as documents, as a collection of floating point numbers - a “vector”. This allows users to calculate the proximity of different vectors to each other, which maps to the semantic similarity of the data represented by these vectors. This is not a new technique, but it has rapidly grown in popularity along with the growth of LLM-based applications.
 
Many LLM applications provide natural language interfaces, which makes vector embeddings and similarity search-based data retrieval a natural fit. For basic applications like "chat with a knowledge base," RAG essentially boils down to “vector similarity search”.

As LLM technology evolves, however, context window sizes have been increasing dramatically. Google’s Gemini 1.5 will support one million tokens, enough to fit the entire text of War and Peace in a prompt. This raises the question of whether one can simply include the entire knowledge base in the prompt and bypass the need for RAG. However, this misunderstands the core of RAG, which extends beyond vector similarity search.

The R in RAG is about retrieval - not just injecting relevant long-term knowledge but also acquiring and incorporating additional contextual information dynamically. For a knowledge base chat application, this additional context could include details like the user's recent purchases, physical location, or other personalized data, often requiring database lookups for real-time retrieval and injection into the prompt.

The option of larger context windows also comes with a price tag for applications that use it extensively. The larger the prompt that is sent to the LLM, the more expensive and time consuming the generation is. In highly interactive applications such as chat agents, there may be many LLM invocations, and these must be completed in near-real time to provide an acceptable user experience. Consequently, an efficient curation of only the relevant documents from a broader knowledge base, i.e. an efficient retrieval of only relevant documents in combination with including user and situation-specific context, will remain a critical element for such RAG solutions.

As database vendors increasingly promote vector databases and emphasize leveraging vector embeddings for similarity search, it is essential to recognize that while these techniques are valuable for document retrieval, they may not fully encompass the complexities and nuances of RAG. The ability of RAG to amalgamate diverse contextual information beyond simple textual similarity positions it as a crucial component for sophisticated language model applications.

Share this