Artificial Intelligence , Machine Learning and Data Science Hubspot

Unlock the Power of Artificial Intelligence, Machine Learning, and Data Science with our Blog Discover the latest insights, trends, and innovations in Artificial Intelligence (AI), Machine Learning (ML), and Data Science through our informative and engaging Hubspot blog. Gain a deep understanding of how these transformative technologies are shaping industries and revolutionizing the way we work. Stay updated with cutting-edge advancements, practical applications, and real-world use.

Wednesday, 18 March 2026

Building RAG Systems with Transformers

Retrieval-Augmented Generation (RAG) has emerged as a powerful paradigm for enhancing the capabilities of large language models. By combining the strengths of retrieval systems with generative models, RAG systems can produce more accurate, factual, and contextually relevant responses. This approach is particularly valuable when dealing with domain-specific knowledge or when up-to-date information is required.

In this post, you will explore how to build a basic RAG system using models from the Hugging Face library. You’ll build each system component, from document indexing to retrieval and generation, and implement a complete end-to-end solution. Specifically, you will learn:

The RAG architecture and its components
How to build a document indexing and retrieval system
How to implement a transformer-based generator

Kick-start your project with my book NLP with Hugging Face Transformers. It provides self-study tutorials with working code.

Let’s get started!

Building RAG Systems with Transformers
Photo by Tina Nord. Some rights reserved.

Overview

This post is divided into five parts:

Understanding the RAG architecture
Building the Document Indexing System
Implementing the Retrieval System
Implementing the Generator
Building the Complete RAG System

Understanding the RAG Architecture

An RAG system consists of two main components:

Retriever: Responsible for finding relevant documents or passages from a knowledge base given a query.
Generator: Uses the retrieved documents and the original query to generate a coherent and informative response.

Each of these components has many fine details. You need RAG because the generator alone (i.e., the language model) cannot generate accurate and contextually relevant responses, which are known as hallucinations. Therefore, you need the retriever to provide hints to help the generator.

This approach combines generative models’ broad language understanding capabilities with the ability to access specific information from a knowledge base. This results in responses that are both fluent and factually accurate.

Let’s implement each component of a RAG system step by step.

Building the Document Indexing System

The first step in creating a RAG system is to build a document indexing system. This system must encode documents into dense vector representations and store them in a database. Then, we can retrieve the documents based on contextual similarity. This means you need to be able to search by vector similarity metrics, not exact matches. This is a key point – not all database systems can be used to build a document indexing system.

Of course, you could collect documents, encode them into vector representations, and keep them in memory. When retrieval is requested, you could compute the similarity one by one to find the closest match. However, checking each vector in a loop is inefficient and not scalable. FAISS is a library that is optimized for this task. To install FAISS, you can compile it from source or use the pre-compiled version from PyPI:

pip install faiss-cpu

In the following, you’ll create a language model to encode documents into dense vector representations and store them in a FAISS index for efficient retrieval:

import faiss
import torch
from transformers import AutoTokenizer, AutoModel
 
tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")
model = AutoModel.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")
 
def generate_embedding(docs, model, tokenizer):
    # Tokenize each text and convert to PyTorch tensors
    inputs = tokenizer(docs, padding=True, truncation=True, return_tensors="pt", max_length=512)
    with torch.no_grad():
        outputs = model(**inputs)
 
    # Embedding defined as mean pooling of all tokens
    attention_mask = inputs["attention_mask"]
    embeddings = outputs.last_hidden_state
 
    expanded_mask = attention_mask.unsqueeze(-1).expand(embeddings.shape).float()
    sum_embeddings = torch.sum(embeddings * expanded_mask, axis=1)
    sum_mask = torch.clamp(expanded_mask.sum(axis=1), min=1e-9)
    mean_embeddings = sum_embeddings / sum_mask
 
    # Convert to numpy array
    return mean_embeddings.cpu().numpy()
 
# Sample document collection
documents = [
    "Transformers are a type of deep learning model introduced in the paper 'Attention "
        "Is All You Need'.",
    "BERT (Bidirectional Encoder Representations from Transformers) is a "
        "transformer-based model designed to understand the context of a word based on "
        "its surroundings.",
    "GPT (Generative Pre-trained Transformer) is a transformer-based model designed for "
        "natural language generation tasks.",
    "T5 (Text-to-Text Transfer Transformer) treats every NLP problem as a text-to-text "
        "problem, where both the input and output are text strings.",
    "RoBERTa is an optimized version of BERT with improved training methodology and more "
        "training data.",
    "DistilBERT is a smaller, faster version of BERT that retains 97% of its language "
        "understanding capabilities.",
    "ALBERT reduces the parameters of BERT by sharing parameters across layers and using "
        "embedding factorization.",
    "XLNet is a generalized autoregressive pretraining method that overcomes the "
        "limitations of BERT by using permutation language modeling.",
    "ELECTRA uses a generator-discriminator architecture for more efficient pretraining.",
    "DeBERTa enhances BERT with disentangled attention and an enhanced mask decoder."
]
 
# Generate embeddings for all documents, then create FAISS index for efficient similarity search
document_embeddings = generate_embedding(documents, model, tokenizer)
dimension = document_embeddings.shape[1]   # Dimension of the embeddings
index = faiss.IndexFlatL2(dimension)       # Using L2 (Euclidean) distance
index.add(document_embeddings)             # Add embeddings to the index
print(f"Created index with {index.ntotal} documents")

The key part of this code is the generate_embedding() function. It takes a list of documents, encodes them through the model, and returns a dense vector representation using mean pooling over all token embeddings from each document. The document does not need to be long and complete. A sentence or paragraph is expected because the models have a context window limit. Moreover, you will see later in another example that a very long document is not ideal for RAG.

You used a pre-trained Sentence Transformer model, sentence-transformers/all-MiniLM-L6-v2, which is specifically designed for generating sentence embeddings. You do not keep the original document in the FAISS index; you only keep the embedding vectors. You pre-build the L2 distance index among these vectors for efficient similarity search.

You may modify this code for different implementations of the RAG system. For example, the dense vector representation is obtained by mean pooling. Still, you can just use the first token since the tokenizer prepends the [CLS] token to each sentence, and the model is supposed to produce the context embedding over this special token. Moreover, L2 distance is used here because you declared the FAISS index intending to use it with the L2 metric. There is no cosine similarity metric in FAISS, but L2 and cosine distance are similar. Note that, with normalized vectors,

$‖ 𝐱 - 𝐲 ‖ 22 = (𝐱 - 𝐲) ⊤ (𝐱 - 𝐲) = 𝐱 ⊤ 𝐱 - 2 𝐱 ⊤ 𝐲 + 𝐲 ⊤ 𝐲 = 2 - 2 𝐱 ⊤ 𝐲 = 2 - 2 c o s 𝜃$

Therefore, L2 distance is equivalent to cosine distance when the vectors are normalized (as long as you remember that when dissimilarity increases, L2 runs from 0 to infinity, but cosine distance decreases from +1 to -1). If you intended to use cosine distance, you should modify the code to become:

...
document_embeddings = generate_embedding(documents, model, tokenizer)
normalized = document_embeddings / np.linalg.norm(document_embeddings, axis=1, keepdims=True)
index.add(normalized)

Essentially, you scaled each embedding vector to make it unit length.

Implementing the Retrieval System

With the documents indexed, let’s see how you can retrieve some of the most relevant documents for a given query:

...
 
def retrieve_documents(query, index, documents, k=3):
    # Generate embedding for the query
    query_embedding = generate_embedding(query, model, tokenizer)   # 1xD matrix
    # Search the index for similar documents
    distances, indices = index.search(query_embedding, k)  # 1xk matrices
    # Return the retrieved documents and their distances
    retrieved_docs = [(documents[idx], float(distances[0][i])) for i, idx in enumerate(indices[0])]
    return retrieved_docs
 
# Example query
query = "What is BERT?"
retrieved_docs = retrieve_documents(query, index, documents)
 
# Print the retrieved documents
print(f"Query: {query}\n")
for i, (doc, distance) in enumerate(retrieved_docs):
    print(f"Document {i+1} (Distance: {distance:.4f}):")
    print(doc)
    print()

If you run this code, you will see the following output:

Query: What is BERT?
 
Document 1 (Distance: 23.7060):
BERT (Bidirectional Encoder Representations from Transformers) is a transformer-based
model designed to understand the context of a word based on its surroundings.
 
Document 2 (Distance: 28.0794):
RoBERTa is an optimized version of BERT with improved training methodology and
more training data.
 
Document 3 (Distance: 29.5908):
DistilBERT is a smaller, faster version of BERT that retains 97% of its language
understanding capabilities.

In the function retrieve_documents(), you provide the query string, the FAISS index, and the document collection. You then generate the embedding for the query just like you did for the documents. Then, you leverage the search() method of the FAISS index to find the k most similar documents to the query embedding. The search() method returns two arrays:

distances: The distances between the query embedding and the indexed embeddings. Since this is how you defined the index, these are the L2 distances.
indices: The indices of the indexed embeddings that are most similar to the query embedding, matching the distances array.

You can use these arrays to retrieve the most similar documents from the original collection. Here, you use the indices to get the documents from the list. Afterward, you print the retrieved documents along with their distances from the query in the embedding space in descending order of relevance or increasing distance.

Note that the document’s context vector is supposed to represent the entire document. Therefore, the distance between the query and the document may be large if the document contains a lot of information. Ideally, you want the documents to be focused and concise. If you have a long text, you may want to split it into multiple documents to make the RAG system more accurate.

This retrieval system forms the first component of our RAG architecture. Given a user query, it allows us to find relevant information from our knowledge base. There are many other ways to implement the same functionality, but this highlights the key idea of vector search.

Implementing the Generator

Next, let’s implement the generator component of our RAG system.

It is a prompt engineering problem. While the user provides a query, you first retrieve the most relevant documents from the retriever and create a new prompt that includes the user’s query and the retrieved documents as context. Then, you use a pre-trained language model to generate a response based on the new prompt.

Here is how you can implement it:

...
 
from transformers import AutoModelForSeq2SeqLM
 
gen_tokenizer = AutoTokenizer.from_pretrained("t5-small")
gen_model = AutoModelForSeq2SeqLM.from_pretrained("t5-small")
 
def generate_response(query, retrieved_docs, max_length=150):
    # Combine the query and retrieved documents into a single prompt
    context = "\n".join(retrieved_docs)
    prompt = f"question: {query} context: {context}"
 
    # Generate a response
    inputs = gen_tokenizer(prompt, return_tensors="pt", max_length=512, truncation=True)
    with torch.no_grad():
        outputs = gen_model.generate(
            inputs.input_ids,
            max_length=max_length,
            num_beams=4,
            early_stopping=True,
            no_repeat_ngram_size=2
        )
    response = gen_tokenizer.decode(outputs[0], skip_special_tokens=True)
    return response
 
# Generate a response for the example query
response = generate_response(query, [doc for doc, score in retrieved_docs])
print("Generated Response:")
print(response)

This is the generator component of our RAG system. You instantiate the pre-trained T5 model (small version, but you can pick a larger one or a different model that fits to run on your system). This model is a sequence-to-sequence model that generates a new sequence from a given sequence. If you use a different model, such as the “causal LM” model, you may need to change the prompt to make it work more efficiently.

In the generate_response() function, you combine the query and the retrieved documents into a single prompt. Then, you use the T5 model to generate a response. You can adjust the generation parameters to make it work better. In the above, only beam search is used for simplicity. The model’s output is then decoded to a text string as the response. Since you combined multiple documents into a single prompt, you need to be careful that the prompt does not exceed the context window of the model.

The generator leverages the information from the retrieved documents to produce a fluent and factually accurate response. The model behaves vastly differently when you just pose the query without context.

Building the Complete RAG System

That’s all you need to build a basic RAG system. Let’s create a function to wrap up the retrieval and generation components:

...
def rag_pipeline(query, documents, retriever_k=3, max_length=150):
    retrieved_docs = retrieve_documents(query, index, documents, k=retriever_k)
    response = generate_response(query, retrieved_docs, max_length=max_length)
    return response, retrieved_docs

Then you can use the RAG pipeline in a loop to generate responses for a set of queries:

...
 
# Example queries
queries = [
    "What is BERT?",
    "How does GPT work?",
    "What is the difference between BERT and GPT?",
    "What is a smaller version of BERT?"
]
# Run the RAG pipeline for each query
for query in queries:
    response, retrieved_docs = rag_pipeline(query, documents)
    print(f"Query: {query}")
    print()
    print("Retrieved Documents:")
    for i, (doc, distance) in enumerate(retrieved_docs):
        print(f"Document {i+1} (Distance: {distance:.4f}):")
        print(doc)
    print()
    print("Generated Response:")
    print(response)
    print("-" * 20)

You can see that the queries are answered one by one in a loop. The set of documents, however, is prepared in advance and reused for all queries. This is how an RAG system typically works.

The complete code of all the above is as follows:

import faiss
import torch
from transformers import AutoTokenizer, AutoModel
from transformers import AutoModelForSeq2SeqLM
 
# Model to use in retriever
tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")
model = AutoModel.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")
# Model to use in generator
gen_tokenizer = AutoTokenizer.from_pretrained("t5-small")
gen_model = AutoModelForSeq2SeqLM.from_pretrained("t5-small")
 
def generate_embedding(docs, model, tokenizer):
    # Tokenize each text and convert to PyTorch tensors
    inputs = tokenizer(docs, padding=True, truncation=True, return_tensors="pt", max_length=512)
    with torch.no_grad():
        outputs = model(**inputs)
 
    # Embedding defined as mean pooling of all tokens
    attention_mask = inputs["attention_mask"]
    embeddings = outputs.last_hidden_state
 
    expanded_mask = attention_mask.unsqueeze(-1).expand(embeddings.shape).float()
    sum_embeddings = torch.sum(embeddings * expanded_mask, axis=1)
    sum_mask = torch.clamp(expanded_mask.sum(axis=1), min=1e-9)
    mean_embeddings = sum_embeddings / sum_mask
 
    # Convert to numpy array
    return mean_embeddings.cpu().numpy()
 
def retrieve_documents(query, index, documents, k=3):
    # Generate embedding for the query
    query_embedding = generate_embedding(query, model, tokenizer)   # 1xD matrix
    # Search the index for similar documents
    distances, indices = index.search(query_embedding, k)  # 1xk matrices
    # Return the retrieved documents and their distances
    retrieved_docs = [(documents[idx], float(distances[0][i])) for i, idx in enumerate(indices[0])]
    return retrieved_docs
 
def generate_response(query, retrieved_docs, max_length=150):
    # Combine the query and retrieved documents into a single prompt
    if retrieved_docs:
        context = "\n".join(retrieved_docs)
        prompt = f"question: {query} context: {context}"
    else:
        prompt = f"question: {query}"
 
    # Generate a response
    inputs = gen_tokenizer(prompt, return_tensors="pt", max_length=512, truncation=True)
    with torch.no_grad():
        outputs = gen_model.generate(
            inputs.input_ids,
            max_length=max_length,
            num_beams=4,
            early_stopping=True,
            no_repeat_ngram_size=2
        )
    response = gen_tokenizer.decode(outputs[0], skip_special_tokens=True)
    return response
 
def rag_pipeline(query, documents, retriever_k=3, max_length=150):
    retrieved_docs = retrieve_documents(query, index, documents, k=retriever_k)
    docs = [doc for doc, distance in retrieved_docs]
    response = generate_response(query, docs, max_length=max_length)
    return response, retrieved_docs
 
# Sample document collection
documents = [
    "Transformers are a type of deep learning model introduced in the paper 'Attention "
        "Is All You Need'.",
    "BERT (Bidirectional Encoder Representations from Transformers) is a "
        "transformer-based model designed to understand the context of a word based on "
        "its surroundings.",
    "GPT (Generative Pre-trained Transformer) is a transformer-based model designed for "
        "natural language generation tasks.",
    "T5 (Text-to-Text Transfer Transformer) treats every NLP problem as a text-to-text "
        "problem, where both the input and output are text strings.",
    "RoBERTa is an optimized version of BERT with improved training methodology and more "
        "training data.",
    "DistilBERT is a smaller, faster version of BERT that retains 97% of its language "
        "understanding capabilities.",
    "ALBERT reduces the parameters of BERT by sharing parameters across layers and using "
        "embedding factorization.",
    "XLNet is a generalized autoregressive pretraining method that overcomes the "
        "limitations of BERT by using permutation language modeling.",
    "ELECTRA uses a generator-discriminator architecture for more efficient pretraining.",
    "DeBERTa enhances BERT with disentangled attention and an enhanced mask decoder."
]
 
# Generate embeddings for all documents, then create FAISS index for efficient similarity search
document_embeddings = generate_embedding(documents, model, tokenizer)
dimension = document_embeddings.shape[1]   # Dimension of the embeddings
index = faiss.IndexFlatL2(dimension)       # Using L2 (Euclidean) distance
index.add(document_embeddings)             # Add embeddings to the index
print(f"Created index with {index.ntotal} documents")
 
# Example queries
queries = [
    "What is BERT?",
    "How does GPT work?",
    "What is the difference between BERT and GPT?",
    "What is a smaller version of BERT?"
]
# Run the RAG pipeline for each query
for query in queries:
    response, retrieved_docs = rag_pipeline(query, documents)
    print(f"Query: {query}")
    print()
    print("Retrieved Documents:")
    for i, (doc, distance) in enumerate(retrieved_docs):
        print(f"Document {i+1} (Distance: {distance:.4f}):")
        print(doc)
    print()
    print("Generated Response:")
    print(response)
    print("-" * 20)

This code is self-contained. All the documents and queries are defined in the code. This is a starting point, and you may extend it for new features, such as saving the indexed documents in a file that you can load later without re-indexing every time.

Summary

This post explored building a Retrieval-Augmented Generation (RAG) system using transformer models from the Hugging Face library. We’ve implemented each system component, from document indexing to retrieval and generation, and combined them into a complete end-to-end solution.

RAG systems represent a powerful approach to enhancing the capabilities of language models by grounding them in external knowledge. RAG systems can produce more accurate, factual, and contextually relevant responses by retrieving relevant information and incorporating it into the generation process.

Artificial Intelligence , Machine Learning and Data Science Hubspot

Wednesday, 18 March 2026

Building RAG Systems with Transformers

Overview

Understanding the RAG Architecture

Building the Document Indexing System

Implementing the Retrieval System

Implementing the Generator

Building the Complete RAG System

Further Readings

Summary

No comments:

Post a Comment

Build an Inference Cache to Save Costs in High-Traffic LLM Apps

Report Abuse

Labels

"Donate for a Noble Cause