Retrieval-augmented generation (RAG) is a method for adding a new data source to a large language model (LLM) without retraining it.
After reading this article you will be able to:
Copy article link
Retrieval-augmented generation (RAG) is the process of optimizing a large language model (LLM) for use in a specific context without completely retraining it, by giving it access to a knowledge base relevant to that context. RAG is a cost-effective way to quickly adapt an LLM to a specialized use case. It is one of the methods developers can use to fine-tune an artificial intelligence (AI) model.
LLMs are trained on massive amounts of data, but they may not be able to generate the specific information needed within some settings. They have a general understanding of how human language works, but do not always have specific expertise for a given topic area. RAG is one way to correct for this.
Imagine a car mechanic goes to work on a 1964 Chevy. While the mechanic may have lots of general expertise in car maintenance after thousands of hours of practice, they still need the owner's manual for the car to effectively service it. But, the mechanic does not need to get re-certified as a mechanic — they can just peruse the manual, then apply the general knowledge of cars they already have.
RAG is similar. It gives an LLM a "manual" — or a knowledge base — so that the LLM can adapt its general knowledge to the specific use case. For instance, a general LLM may be able to generate content on how API queries work, but not necessarily be able to tell a user how to query one specific application's API. But with RAG, that LLM could be adapted for use with that application by linking it to the application's technical documentation.
Ordinarily when an LLM receives a query, it processes that query according to its preexisting parameters and training.
RAG enables the LLM to reference "external data" — a database not included in the LLM's training data set. So in contrast to the normal way an LLM functions, an LLM with RAG uses external data to enhance its responses — hence the name retrieval (retrieving external data first) augmented (improving responses with this data) generation (creating a response).
Suppose the car mechanic from the example above wanted to use an LLM instead of consulting owners’ manuals directly. Using RAG, the LLM could incorporate consultation of owners’ manuals directly into its processes. Despite the fact that the LLM was likely not trained on classic car manuals — or, if such information was included in its training data, it likely was only a tiny percentage — the LLM can use RAG to produce relevant and accurate queries.
For the LLM to be able to use and query the external data source (such as car manuals), the data is first converted into vectors and then stored in a vector database. This process uses machine learning models to generate mathematical representations of items of a data set. (Each "vector" is an array of numbers, like [-0.41522345,0.97685323...].)
When a user sends a prompt, the LLM converts that prompt into a vector, then searches the vector database created from the external data source to find relevant information. It adds that information as additional context for the prompt, then runs the augmented prompt through its typical model to create a response.
The pros of using RAG for fine-tuning an LLM include:
Some of the potential downsides are:
A RAG chatbot is an LLM-based chatbot that has been specialized for specific use cases through RAG — being connected to one or more sources of external data that are relevant to the context in which the chatbot operates. A RAG chatbot for use in an auto garage would have access to automobile documentation; this would be more useful for the mechanics in the garage than asking questions of a general-use LLM chatbot.
Low-rank adaptation (LoRA) is another way to fine-tune a model — meaning, adapt it to a specific context without completely retraining it. LoRA, however, does involve adjusting the model's parameters, whereas RAG does not alter the model's parameters at all. Learn more about LoRA here.
AutoRAG sets up and manages RAG pipelines for developers. It connects the tools needed for indexing, retrieval, and generation, and keeps everything up to date by syncing data with the index regularly. Once set up, AutoRAG indexes content in the background and responds to queries in real time. Learn how AutoRAG works.