The limitations of LLMs, or why are we doing RAG?

June 17, 2024

This blog was co-authored by Phil Eaton, Bilge Ince, and Artjoms Iskovs.

Despite powerful capabilities with many tasks, Large Language Models (LLMs) are not know-it-alls. If you've used ChatGPT or other models, you'll have experienced how they can’t reasonably answer questions about proprietary information. What’s worse, it isn’t just that they don't know about proprietary information, they are unaware of their own limitations and, even if they were aware, they don’t have access to proprietary information. That's where options like Retrieval Augmented Generation (RAG) come in and give LLMs the ability to incorporate new and proprietary information into their answers.

Taking a step back

An LLM is an advanced form of artificial intelligence that understands and generates human-like text. It learns from extensive written material on various topics, enabling it to answer questions, write essays, summarize information, and engage in conversation. They operate using a neural network, which is inspired by the human brain. This network consists of layers of nodes or "neurons" that process information. The connections between these neurons have "weights," which adjust as the model learns from the data it is trained on. This process enables the model to produce relevant and contextually appropriate text based on the input it receives.

Take ChatGPT for example. The models are GPT-4o, GPT-3.5 Turbo, etc. The context window is the result of tokenizing the prompt text you type into ChatGPT combined with ChatGPT’s system prompt. The output is the result the model(GPT-4o, GPT-3.5 Turbo, etc.) produces. ChatGPT is only one of many applications that use LLMs, but it is helpful to use specific examples.

GPT family models, and other general-purpose models, are excellent for general inquiries. However, they fall short when the need requires asking specific questions about the company's internal data. Take ChatGPT as an example that uses the GPT-4o model; if a developer asks, "What were the key changes made in the last major release?" or "Are there any unresolved critical bugs related to the payment gateway?" the LLM won’t be able to provide accurate and up-to-date answers because it lacks access to the proprietary data stored in JIRA and git. Therefore you need to get creative to address your needs.

Adapting to your domain

You could train a new model from scratch on your data. It may be a worthwhile investment if there is a specific task you expect the model to handle often. Otherwise, a general purpose model can take months of expensive hardware to train. And since it takes months to train, the new model would not “know” about any data you had added or modified in the meantime.

Another approach involves fine-tuning a smaller model for tasks that demand specialized knowledge beyond the scope of general LLMs. Fine-tuning facilitates the customization of models tailored to specific domains, tasks, or languages, yielding superior performance in these areas compared to general LLM. Also, it allows for addressing specific concerns and ensures that the model's outputs align with efforts to minimize bias.

However, fine-tuning a model requires access to high-quality data curated explicitly for the task and, more importantly, expertise in machine learning and domain-specific knowledge, as well as maintenance and scaling.

An increasingly popular option is Retrieval Augmented Generation (RAG): automatically adding relevant (proprietary) data, by concatenating text, to a user’s prompt while  running an LLM model.

The context window

 

There are two challenges. First, LLM context windows can fit limited data. Second, you pay for the size of the context window. (Even if you run an LLM like llama3 on your own hardware, you still pay for context window size in terms of the amount of RAM you need.)

Context windows typically support a maximum of between 4096 to 1M tokens. For a text prompt, a token is the same old NLP term you knew from decades ago, though different models tokenize input differently. For example, the quote “Diseased Nature often times breaks forth in strange eruptions.” has 8 words and a period. But GPT’s tokenizer produces 11 tokens.

So we rely on the base intelligence provided by a model. And we paste additional text as a prefix to a user prompt to provide additional proprietary context.

The challenge then is what text makes it to the prompt? There is not infinite space in the context window. 

Processing text in AI systems like ChatGPT incurs costs that directly correlate with the input size. Larger texts demand increased memory, processing power, and time for handling, inevitably leading to higher operational expenses. 

Retrieving relevant information

With RAG we split answering a user prompt into two stages. First, we decide on the most relevant, limited proprietary information for the user’s prompt. Second, we concatenate the most relevant information with the user’s initial prompt and feed it to a generative model to receive a response in return.

The first step of RAG is Retrieval. The vector search is one option for retrieving relevant proprietary documents for RAG to operate. It aims to find semantically similar vectors by calculating the proximity of different vectors to each other, so a search for “food” might pull up relevant proprietary documents about “bananas”. But any search method could work, including Lucene-style full-text search. The ideal search algorithm for RAG might depend on your workload, but the popular choice today is vector similarity search.

Vector similarity search requires both a database for storing, indexing, and searching vectors; and it also requires a method for transforming documents into vectors (called embeddings) that can be put into the database. LLMs themselves can be used to convert text to embeddings to be stored in a vector database. In this scenario you use the LLM only for generating embeddings and not for going all the way to generate text. Furthermore, the LLM you use to generate embeddings for  the Retrieval part of a RAG application does not need to be the same LLM you use in the text generation stage of your RAG application.

For Postgres users, the pgvector extension serves as the vector database. We must set up a process to generate embeddings from our documents and store them with pgvector. Then, when given a user’s prompt, we can use pgvector’s vector similarity search to retrieve the most relevant document text and prepend that text to the user’s prompt.

You can build RAG applications on top of LLMs with entirely open-source components and host them yourself. Or you can use a platform like EDB Postgres AI to manage the Postgres and pgvector parts, and to automatically generate (and update) embeddings on chosen columns of existing tables.

More from the EDB blog on the topic:

Share this