Unlock the Power of Artificial Intelligence, Machine Learning, and Data Science with our Blog
Discover the latest insights, trends, and innovations in Artificial Intelligence (AI), Machine Learning (ML), and Data Science through our informative and engaging Hubspot blog. Gain a deep understanding of how these transformative technologies are shaping industries and revolutionizing the way we work.
Stay updated with cutting-edge advancements, practical applications, and real-world use.
Wednesday, 18 March 2026
Building RAG Systems with Transformers
Retrieval-Augmented Generation (RAG) has emerged as a powerful paradigm for enhancing the capabilities of large language models. By combining the strengths of retrieval systems with generative models, RAG systems can produce more accurate, factual, and contextually relevant responses. This approach is particularly valuable when dealing with domain-specific knowledge or when up-to-date information is required.
In this post, you will explore how to build a basic RAG system using models from the Hugging Face library. You’ll build each system component, from document indexing to retrieval and generation, and implement a complete end-to-end solution. Specifically, you will learn:
The RAG architecture and its components
How to build a document indexing and retrieval system
Building RAG Systems with Transformers Photo by Tina Nord. Some rights reserved.
Overview
This post is divided into five parts:
Understanding the RAG architecture
Building the Document Indexing System
Implementing the Retrieval System
Implementing the Generator
Building the Complete RAG System
Understanding the RAG Architecture
An RAG system consists of two main components:
Retriever: Responsible for finding relevant documents or passages from a knowledge base given a query.
Generator: Uses the retrieved documents and the original query to generate a coherent and informative response.
Each of these components has many fine details. You need RAG because the generator alone (i.e., the language model) cannot generate accurate and contextually relevant responses, which are known as hallucinations. Therefore, you need the retriever to provide hints to help the generator.
This approach combines generative models’ broad language understanding capabilities with the ability to access specific information from a knowledge base. This results in responses that are both fluent and factually accurate.
Let’s implement each component of a RAG system step by step.
Building the Document Indexing System
The first step in creating a RAG system is to build a document indexing system. This system must encode documents into dense vector representations and store them in a database. Then, we can retrieve the documents based on contextual similarity. This means you need to be able to search by vector similarity metrics, not exact matches. This is a key point – not all database systems can be used to build a document indexing system.
Of course, you could collect documents, encode them into vector representations, and keep them in memory. When retrieval is requested, you could compute the similarity one by one to find the closest match. However, checking each vector in a loop is inefficient and not scalable. FAISS is a library that is optimized for this task. To install FAISS, you can compile it from source or use the pre-compiled version from PyPI:
pip install faiss-cpu
In the following, you’ll create a language model to encode documents into dense vector representations and store them in a FAISS index for efficient retrieval:
dimension=document_embeddings.shape[1]# Dimension of the embeddings
index=faiss.IndexFlatL2(dimension)# Using L2 (Euclidean) distance
index.add(document_embeddings)# Add embeddings to the index
print(f"Created index with {index.ntotal} documents")
The key part of this code is the generate_embedding() function. It takes a list of documents, encodes them through the model, and returns a dense vector representation using mean pooling over all token embeddings from each document. The document does not need to be long and complete. A sentence or paragraph is expected because the models have a context window limit. Moreover, you will see later in another example that a very long document is not ideal for RAG.
You used a pre-trained Sentence Transformer model, sentence-transformers/all-MiniLM-L6-v2, which is specifically designed for generating sentence embeddings. You do not keep the original document in the FAISS index; you only keep the embedding vectors. You pre-build the L2 distance index among these vectors for efficient similarity search.
You may modify this code for different implementations of the RAG system. For example, the dense vector representation is obtained by mean pooling. Still, you can just use the first token since the tokenizer prepends the [CLS] token to each sentence, and the model is supposed to produce the context embedding over this special token. Moreover, L2 distance is used here because you declared the FAISS index intending to use it with the L2 metric. There is no cosine similarity metric in FAISS, but L2 and cosine distance are similar. Note that, with normalized vectors,
Therefore, L2 distance is equivalent to cosine distance when the vectors are normalized (as long as you remember that when dissimilarity increases, L2 runs from 0 to infinity, but cosine distance decreases from +1 to -1). If you intended to use cosine distance, you should modify the code to become:
If you run this code, you will see the following output:
Query: What is BERT?
Document 1 (Distance: 23.7060):
BERT (Bidirectional Encoder Representations from Transformers) is a transformer-based
model designed to understand the context of a word based on its surroundings.
Document 2 (Distance: 28.0794):
RoBERTa is an optimized version of BERT with improved training methodology and
more training data.
Document 3 (Distance: 29.5908):
DistilBERT is a smaller, faster version of BERT that retains 97% of its language
understanding capabilities.
In the function retrieve_documents(), you provide the query string, the FAISS index, and the document collection. You then generate the embedding for the query just like you did for the documents. Then, you leverage the search() method of the FAISS index to find the k most similar documents to the query embedding. The search() method returns two arrays:
distances: The distances between the query embedding and the indexed embeddings. Since this is how you defined the index, these are the L2 distances.
indices: The indices of the indexed embeddings that are most similar to the query embedding, matching the distances array.
You can use these arrays to retrieve the most similar documents from the original collection. Here, you use the indices to get the documents from the list. Afterward, you print the retrieved documents along with their distances from the query in the embedding space in descending order of relevance or increasing distance.
Note that the document’s context vector is supposed to represent the entire document. Therefore, the distance between the query and the document may be large if the document contains a lot of information. Ideally, you want the documents to be focused and concise. If you have a long text, you may want to split it into multiple documents to make the RAG system more accurate.
This retrieval system forms the first component of our RAG architecture. Given a user query, it allows us to find relevant information from our knowledge base. There are many other ways to implement the same functionality, but this highlights the key idea of vector search.
Implementing the Generator
Next, let’s implement the generator component of our RAG system.
It is a prompt engineering problem. While the user provides a query, you first retrieve the most relevant documents from the retriever and create a new prompt that includes the user’s query and the retrieved documents as context. Then, you use a pre-trained language model to generate a response based on the new prompt.
This is the generator component of our RAG system. You instantiate the pre-trained T5 model (small version, but you can pick a larger one or a different model that fits to run on your system). This model is a sequence-to-sequence model that generates a new sequence from a given sequence. If you use a different model, such as the “causal LM” model, you may need to change the prompt to make it work more efficiently.
In the generate_response() function, you combine the query and the retrieved documents into a single prompt. Then, you use the T5 model to generate a response. You can adjust the generation parameters to make it work better. In the above, only beam search is used for simplicity. The model’s output is then decoded to a text string as the response. Since you combined multiple documents into a single prompt, you need to be careful that the prompt does not exceed the context window of the model.
The generator leverages the information from the retrieved documents to produce a fluent and factually accurate response. The model behaves vastly differently when you just pose the query without context.
Building the Complete RAG System
That’s all you need to build a basic RAG system. Let’s create a function to wrap up the retrieval and generation components:
You can see that the queries are answered one by one in a loop. The set of documents, however, is prepared in advance and reused for all queries. This is how an RAG system typically works.
This code is self-contained. All the documents and queries are defined in the code. This is a starting point, and you may extend it for new features, such as saving the indexed documents in a file that you can load later without re-indexing every time.
Further Readings
Below are some further readings that you may find useful:
This post explored building a Retrieval-Augmented Generation (RAG) system using transformer models from the Hugging Face library. We’ve implemented each system component, from document indexing to retrieval and generation, and combined them into a complete end-to-end solution.
RAG systems represent a powerful approach to enhancing the capabilities of language models by grounding them in external knowledge. RAG systems can produce more accurate, factual, and contextually relevant responses by retrieving relevant information and incorporating it into the generation process.
No comments:
Post a Comment