Product

Saturday, 21 March 2026

Build an Inference Cache to Save Costs in High-Traffic LLM Apps

 

In this article, you will learn how to add both exact-match and semantic inference caching to large language model applications to reduce latency and API costs at scale.

Topics we will cover include:

  • Why repeated queries in high-traffic apps waste time and money.
  • How to build a minimal exact-match cache and measure the impact.
  • How to implement a semantic cache with embeddings and cosine similarity.

Alright, let’s get to it.

Build an Inference Cache to Save Costs in High-Traffic LLM Apps

Build an Inference Cache to Save Costs in High-Traffic LLM Apps
Image by Editor

Introduction

Large language models (LLMs) are widely used in applications like chatbots, customer support, code assistants, and more. These applications often serve millions of queries per day. In high-traffic apps, it’s very common for many users to ask the same or similar questions. Now think about it: is it really smart to call the LLM every single time when these models aren’t free and add latency to responses? Logically, no.

Take a customer service bot as an example. Thousands of users might ask questions every day, and many of those questions are repeated:

  • “What’s your refund policy?”
  • “How do I reset my password?”
  • “What’s the delivery time?”

If every single query is sent to the LLM, you’re just burning through your API budget unnecessarily. Each repeated request costs the same, even though the model has already generated that answer before. That’s where inference caching comes in. You can think of it as memory where you store the most common questions and reuse the results. In this article, I’ll walk you through a high-level overview with code. We’ll start with a single LLM call, simulate what high-traffic apps look like, build a simple cache, and then take a look at a more advanced version you’d want in production. Let’s get started.

Setup

Install dependencies. I am using Google Colab for this demo. We’ll use the OpenAI Python client:

Set your OpenAI API key:

Step 1: A Simple LLM Call

This function sends a prompt to the model and prints how long it takes:

Output:

This works fine for one call. But what if the same question is asked over and over?

Step 2: Simulating Repeated Questions

Let’s create a small list of user queries. Some are repeated, some are new:

Let’s see what happens if we call the LLM for each:

Output:

Every time, the LLM is called again. Even though two queries are identical, we’re paying for both. With thousands of users, these costs can skyrocket.

Step 3: Adding an Inference Cache (Exact Match)

We can fix this with a dictionary-based cache as a naive solution:

Output:

Now:

  • The first time “What is your refund policy?” is asked, it calls the LLM.
  • The second time, it instantly retrieves from cache.

This saves cost and reduces latency dramatically.

Step 4: The Problem with Exact Matching

Exact matching works only when the query text is identical. Let’s see an example:

Output:

Both queries ask about refunds, but since the text is slightly different, our cache misses. That means we still pay for the LLM. This is a big problem in the real world because users phrase questions differently.

Step 5: Semantic Caching with Embeddings

To fix this, we can use semantic caching. Instead of checking if text is identical, we check if queries are similar in meaning. We can use embeddings for this:

Output:

Even though the second query is worded differently, the semantic cache recognizes its similarity and reuses the answer.

Conclusion

If you’re building customer support bots, AI agents, or any high-traffic LLM app, caching should be one of the first optimizations you put in place.

  • Exact cache saves cost for identical queries.
  • Semantic cache saves cost for meaningfully similar queries.
  • Together, they can massively reduce API calls in high-traffic apps.

In real-world production apps, you’d store embeddings in a vector database like FAISS, Pinecone, or Weaviate for fast similarity search. But even this small demo shows how much cost and time you can save.

The Machine Learning Practitioner’s Guide to Agentic AI Systems

 

In this article, you will learn how practitioners can evolve from traditional machine learning workflows to designing, building, and shipping production-ready agentic AI systems.

Topics we will cover include:

  • What makes an AI system “agentic” and why that matters for practitioners.
  • The core architectural patterns (ReAct, Plan-and-Execute, Reflexion) and when to use each.
  • Practical frameworks, projects, and resources to develop portfolio-ready agent skills.

Let’s not waste any more time.

The Machine Learning Practitioner’s Guide to Agentic AI Systems

The Machine Learning Practitioner’s Guide to Agentic AI Systems
Image by Author

Introduction

Agentic artificial intelligence (AI) represents the most significant shift in machine learning since deep learning transformed the field. Rather than building reactive tools that respond to prompts, practitioners now design autonomous systems that plan, reason, and act independently to achieve complex goals. This transformation is reshaping how we approach machine learning problems, from simple classification tasks to sophisticated multi-step workflows that require strategic thinking and tool use.

For machine learning and data science practitioners, this evolution builds naturally on your existing foundation. The core skills you’ve developed — prompt engineering, working with large language models (LLMs), building retrieval-augmented generation (RAG) systems — are now the building blocks for creating agentic systems. The transition requires learning new architectural patterns and frameworks, but you’re starting from a position of strength.

In this guide, you’ll discover a step-by-step approach to transition from traditional machine learning to agentic AI. You’ll learn the core concepts, explore the most effective frameworks, access the best learning resources, and understand how to build production-ready agents that solve real problems. This guide is designed for practitioners who want results, not just theory.

Grounding Yourself In The Basics

Before diving into agent frameworks, you need to understand what makes AI “agentic” and why it matters.

Agentic AI refers to autonomous systems that pursue goals independently through planning, reasoning, tool use, and memory, rather than simply responding to prompts. While traditional LLMs are reactive (you ask, they answer), agentic systems proactively break down complex tasks, make decisions, use tools, learn from feedback, and adapt their approach without constant human guidance.

If you’re already working with LLMs, you have exactly the foundation you need. Agentic AI builds directly on prompt engineering, RAG systems, and LLM applications. If you need a refresher, check out our guides on prompt engineering, our RAG series, and LLM applications.

Start here (FREE): Agentic AI with Andrew Ng. This is your best first step. It’s free during the beta period and teaches core design patterns from a leading expert.

Learning The Core Architectural Patterns

The key to building effective agents is understanding how they think and act. There are three foundational architectures every practitioner should know.

ReAct (Reasoning and Acting) is the most common starting pattern. The agent alternates between reasoning about what to do, taking an action with a tool, observing the result, and repeating until the task is complete. It’s simple to implement and works well for straightforward tasks, but it can be expensive because it requires an LLM call for each step.

Plan-and-Execute separates planning from execution. The agent first creates a complete multi-step plan, then executes each step (often with smaller, cheaper models), and adjusts the plan if needed. This approach is frequently faster and cheaper than ReAct for complex workflows, making it a go-to choice for production systems in 2025.

Reflexion adds self-improvement through linguistic feedback. The agent explicitly critiques its own responses, maintains memory of past attempts, and refines its approach based on failures. It’s especially valuable for research-intensive and high-stakes applications where correctness matters more than speed.

Understanding these patterns helps you choose the right architecture for your use case. Simple customer service queries? ReAct works great. Complex multi-step workflows like data analysis pipelines? Plan-and-Execute. Research agents that need accuracy? Reflexion.

Learn more (FREE): Take the AI Agentic Design Patterns with AutoGen course on DeepLearning.AI to see these patterns implemented hands-on.

Choosing Your Framework And Learning It Deeply

This is where theory meets practice. You need to pick a framework and build real systems with it. The space has three dominant players in 2025: LangGraph, CrewAI, and AutoGen. Each framework serves different needs.

LangGraph is a standard for production systems. It provides fine-grained control through graph-based workflows, built-in state management, and excellent observability through LangGraph Studio and LangSmith. If you need complex, stateful workflows with detailed monitoring, this is your framework. The learning curve is steeper, but it’s worth it for professional deployment.

CrewAI is the fastest way to get started with multi-agent systems. Its role-based design makes it intuitive. You define agents with specific personas and responsibilities, assign tasks, and let them collaborate. It’s an excellent fit for content creation, research pipelines, and any scenario where you can think in terms of “team roles.”

AutoGen (now part of Microsoft’s agent framework) excels at conversational multi-agent patterns. It’s ideal for complex agent collaboration and enterprise Microsoft environments. The March 2025 update introduced a unified SDK, an Agent-to-Agent protocol, and seamless Azure AI Foundry integration.

Pick one framework to start. Don’t try to learn all three at once. For most practitioners, start with CrewAI for rapid prototyping, then learn LangGraph when you need production-grade control.

Building Practical Projects That Demonstrate Skills

Theory without practice won’t land you opportunities. You need portfolio projects that prove you can build production-ready agents.

Start simple: Build a research agent that takes a question, searches multiple sources, synthesizes information, and provides a cited answer. This project teaches you tool integration (web search), memory management (tracking sources), and response generation.

Next level: Create a multi-agent content creation system. Define agents with specific roles: researcher, writer, editor, fact-checker. Then orchestrate them to produce polished articles. This demonstrates understanding of agent coordination and task delegation. Our tutorial on Building Your First Multi-Agent System: A Beginner’s Guide walks through this with CrewAI.

Advanced: Build an autonomous data analysis agent that connects to your databases, explores data based on natural language queries, generates insights, creates visualizations, and flags anomalies — all without step-by-step human guidance. This showcases RAG techniques, tool use, and planning capabilities.

Hands-on resources:

Learning Memory Systems And Advanced Patterns

What separates junior agent developers from experts is understanding of memory and advanced reasoning.

Memory systems are essential for agents that need context across conversations. Short-term memory (session state) handles current interactions using tools like Redis or LangGraph’s built-in checkpointer. Long-term memory requires more sophistication: vector stores for semantic retrieval, knowledge graphs for structured facts with temporal tracking, and summarization strategies to prevent memory bloat.

The 2025 best practice is a hybrid approach: vector search for semantic retrieval, knowledge graphs for factual accuracy and updates, and decay strategies to manage growth. LangGraph’s LangMem module and the Redis Agent Memory Server are production-proven solutions.

Advanced patterns to learn include agentic RAG (where agents decide when to retrieve information and generate targeted queries), multi-agent orchestration (the “puppeteer” pattern where a trained orchestrator dynamically directs specialist agents), and human-in-the-loop workflows (escalating important decisions while maintaining autonomy for routine tasks).

The Model Context Protocol (MCP), adopted broadly in 2025, is transforming agent connectivity. Learning MCP now future-proofs your skills as it becomes a standard for connecting agents to tools and data sources.

Deep-dive resources:

Putting Your Learning Into Practice

You now have a comprehensive roadmap from foundations to applications. As you develop these skills, you’ll find opportunities across a range of roles: AI Engineer, Machine Learning Engineer (with an agent focus), AI Architect, MLOps Engineer, and the emerging Agent Orchestrator position. These roles span entry-level through senior positions across industries, all requiring the foundational knowledge you’ve gained from this guide.

The agentic AI field is growing rapidly, with the market expanding from 5–7⁢𝑏⁢𝑖⁢𝑙⁢𝑙⁢𝑖⁢𝑜⁢𝑛⁢𝑖⁢𝑛⁢2025⁢𝑡⁢𝑜⁢𝑎⁢𝑝⁢𝑟⁢𝑜⁢𝑗⁢𝑒⁢𝑐⁢𝑡⁢𝑒⁢𝑑50–200 billion by 2030–2034. Organizations across financial services, healthcare, retail, and professional services are actively deploying agent systems. This growth creates opportunities for practitioners who understand both the technical foundations and practical implementation of agentic systems. The practitioners developing these skills now are positioning themselves at the forefront of a rapidly evolving field.

Building Transformer Models from Scratch with PyTorch (10-day Mini-Course)

 

You’ve likely used ChatGPT, Gemini, or Grok, which demonstrate how large language models can exhibit human-like intelligence. While creating a clone of these large language models at home is unrealistic and unnecessary, understanding how they work helps demystify their capabilities and recognize their limitations.

All these modern large language models are decoder-only transformers. Surprisingly, their architecture is not overly complex. While you may not have extensive computational power and memory, you can still create a smaller language model that mimics some capabilities of the larger ones. By designing, building, and training such a scaled-down version, you’ll better understand what the model is doing, rather than simply viewing it as a black box labeled “AI.”

In this 10-part crash course, you’ll learn through examples how to build and train a transformer model from scratch using PyTorch. The mini-course focuses on model architecture, while advanced optimization techniques, though important, are beyond our scope. We’ll guide you from data collection through to running your trained model. Each lesson covers a specific transformer component, explaining its role, design parameters, and PyTorch implementation. By the end, you’ll have explored every aspect of the model and gained a comprehensive understanding of how transformer models work.

Let’s get started.

 


Building Transformer Models from Scratch with PyTorch (10-day Mini-Course)
Photo by Caleb Jack. Some rights reserved.

Who Is This Mini-Course For?

Before we begin, let’s make sure you’re in the right place. The list below provides general guidelines on whom this course is designed for. Don’t worry if you don’t match these points exactly—you might just need to brush up on certain areas to keep up.

  • Developers with some coding experience. You should be comfortable writing Python code and setting up your development environment (a prerequisite). You don’t need to be an expert coder, but you should be able to install packages and write scripts without hesitation.
  • Developers with basic machine learning knowledge. You should have a general understanding of machine learning models and feel comfortable using them. You don’t need to be an expert, but you should not be afraid to learn more about them.
  • Developers familiar with PyTorch. This project is based on PyTorch. To keep it concise, we will not cover the basics of PyTorch. You are not required to be a PyTorch expert, but you are expected to be able to read and understand PyTorch code, and more importantly, know how to read the documentation of PyTorch in case you encountered any functions that you are not familiar with.

This mini-course is not a textbook on transformer or LLM. Instead, it serves as a project-based guide that takes you step by step from a developer with minimal experience to one who can confidently demonstrate how a transformer model is created.

Mini-Course Overview

This mini-course is divided into 10 parts.

Each lesson is designed to take about 30 minutes for the average developer. While some lessons may be completed more quickly, others might require more time if you choose to explore them in depth.
You can progress at your own pace. We recommend following a comfortable schedule of one lesson per day over ten days to allow for proper absorption of the material.

The topics you will cover over the next 10 lessons are as follows:

  • Lesson 1: Getting the Data
  • Lesson 2: Train a Tokenizer for Your Language Model
  • Lesson 3: Positional Encoding
  • Lesson 4: Grouped Query Attention
  • Lesson 5: Causal Mask
  • Lesson 6: Mixture of Expert Models
  • Lesson 7: RMS Norm and Skip Connection
  • Lesson 8: The Complete Transformer Model
  • Lesson 9: Training the Model
  • Lesson 10: Using the Model

This journey will be both challenging and rewarding.
While it requires dedication through reading, research, and programming, the hands-on experience you’ll gain in building a transformer model will be invaluable.

Post your results in the comments; I’ll cheer you on!

Hang in there; don’t give up.

You can download the code of this post here.

Lesson 01: Getting the Data

We are building a language model using transformer architecture. A language model is a probabilistic representation of human language that predicts the likelihood of words appearing in a sequence. Rather than being manually constructed, these probabilities are learned from data. Therefore, the first step in building a language model is to collect a large corpus of text that captures the natural patterns of language use.

There are numerous sources of text data available. Project Gutenberg is an excellent source of free text data, offering a wide variety of books across different genres. Here’s how you can download text data from Project Gutenberg to your local directory:

This code downloads each book as a separate text file. Since Project Gutenberg provides pre-cleaned text, we only need to extract the book contents and store them as a list of strings in Python:

The preprocess_gutenberg() function removes the Project Gutenberg header and footer from each book and joins the lines into a single string. The get_dataset_text() function applies this preprocessing to all books and returns a list of strings, where each string represents a complete book.

Your Task

Try running the code above! While this small collection of books would typically be insufficient for training a production-ready language model, it serves as an excellent starting point for learning. Notice that the books in the DATASOURCE dictionary span various genres. Can you think about why having diverse genres is important when building a language model?

In the next lesson, you will learn how to convert the textual data into numbers.

Lesson 02: Train a Tokenizer for Your Language Model

Computers operate on numbers, so text must be converted into numerical form for processing. In a language model, we assign numbers to “tokens,” and these thousands of distinct tokens form the model’s vocabulary.

A simple approach would be to open a dictionary and assign a number to each word. However, this naive method cannot handle unseen words effectively. A better approach is to train an algorithm that processes input text and breaks it down into tokens. This algorithm, called a tokenizer, splits text efficiently and can handle unseen words.

There are several approaches to training a tokenizer. Byte-pair encoding (BPE) is one of the most popular methods used in modern LLMs. Let’s use the tokenizer library to train a BPE tokenizer using the text we collected in the previous lesson:

This example creates a small BPE tokenizer with a vocabulary size of 10,000. Production LLMs typically use vocabularies that are orders of magnitude larger for better language coverage. Even for this toy project, training a tokenizer takes time as it analyzes character collocations to form words. It’s recommended to save the tokenizer as a JSON file, as shown above, so you can easily reload it later:

Your Task

Besides BPE, WordPiece is another common tokenization algorithm. Try creating a WordPiece version of the tokenizer above.

Why is a vocabulary size of 10,000 insufficient for a good language model? Research the number of words in a typical English dictionary and explain the implications for language modeling.

In the next lesson, you’ll learn about positional encoding.

Lesson 03: Positional Encoding

Unlike recurrent neural networks, transformer models process entire sequences simultaneously. However, this parallel processing means they lack inherent understanding of token order. Since token position is crucial for understanding context, transformer models incorporate positional encodings into their input processing to capture this sequential information.

While several positional encoding methods exist, Rotary Positional Encoding (RoPE) has emerged as the most widely used approach. RoPE operates by applying rotational transformations to the embedded token vectors. Each token is represented as a vector, and the encoding process involves multiplying pairs of vector elements by a 2 ×2 rotation matrix:

ˆ𝐱𝑚=𝐑𝑚⁢𝐱𝑚=[cos⁡(𝑚⁢𝜃𝑖)−sin⁡(𝑚⁢𝜃𝑖)sin⁡(𝑚⁢𝜃𝑖)cos⁡(𝑚⁢𝜃𝑖)]⁢𝐱𝑚

To implement RoPE, you can use the following PyTorch code:

The RotaryPositionalEncoding module implements the positional encoding mechanism for input sequences. Its __init__ function pre-computes sine and cosine values for all possible positions and dimensions, while the forward function applies the rotation matrix to transform the input.

An important implementation detail is the use of register_buffer in the __init__ function to store sine and cosine values. This tells PyTorch to treat these tensors as non-trainable model parameters, ensuring proper management across different computing devices (e.g., GPU) and during model serialization.

Your Task

Experiment with the code provided above. Earlier, we learned that RoPE applies to embedded token vectors in a sequence. Take a closer look at the input tensor sequence used to test the RotaryPositionalEncoding module: why is it a 4D tensor? While the last dimension (128) represents the embedding size, can you identify what the first three dimensions (1, 10, 4) represent in the context of transformer architecture?

In the next lesson, you will learn about the attention block.

Lesson 04: Grouped Query Attention

The signature component of a transformer model is its attention mechanism. When processing a sequence of tokens, the attention mechanism builds connections between tokens to understand their context.

The attention mechanism predates transformer models, and several variants have evolved over time. In this lesson, you will learn to implement Grouped Query Attention (GQA).

A transformer model begins with a sequence of embedded tokens, which are essentially vectors. The modern attention mechanism computes an output sequence based on three input sequences: query, key, and value. These three sequences are derived from the input sequence through different projections:

The projection is performed by a fully-connected neural network layer that operates on the input tensor’s last dimension. As shown above, the projection’s output is reshaped using view() and then transposed. The input tensor x is 3D, and the view() function transforms it into a 4D tensor by splitting the last dimension into two: the attention heads and the head dimension. The transpose() function then swaps the sequence length dimension with the attention head dimension.

The resulting 4D tensor has attention operations that only involve the last two dimensions. The actual attention computation is performed using PyTorch’s built-in scaled_dot_product_attention() function. The result is then reshaped back into a 3D tensor and projected to the original dimension.

This architecture is called grouped query attention because it uses different numbers of heads for queries versus keys and values. Typically, the number of query heads is a multiple of the number of key-value heads.

Since we will use such attention mechanism a lot, let’s create a class for it:

The forward function includes two optional arguments: mask and rope. The rope argument expects a module that applies rotary positional encoding, which was covered in the previous lesson. The mask argument will be explained in the next lesson.

Your Task

Consider why this implementation is called grouped query attention. The original transformer architecture uses multihead attention. How would you modify this grouped query attention implementation to create a multihead attention mechanism?

In the next lesson, you’ll learn about masking in attention operations.

Lesson 05: Causal Mask

A key characteristic of decoder-only transformer models is the use of causal masks in their attention layers. A causal mask is a matrix applied during attention score calculation to prevent the model from attending to future tokens. Specifically, a query token 𝑖 can only attend to key tokens 𝑗 where 𝑗 ≤𝑖.

With query and key sequences of length 𝑁, the causal mask is a square matrix of shape (𝑁,𝑁). The element (𝑖,𝑗) indicates whether query token 𝑖 can attend to the key token 𝑗.

In a boolean mask matrix, the element (𝑖,𝑗) is True for 𝑖 ≤𝑗, making all elements on and below the diagonal True. However, we typically use a floating-point matrix because we can simply add it to the attention score matrix before applying softmax normalization. In this case, elements where 𝑖 ≤𝑗 are set to 0, and all other elements are set to −∞.

Creating such a causal mask is straightforward in PyTorch:

This creates a matrix of shape (𝑁,𝑁) filled with −∞, then uses the triu() function to zero out all elements on and below the diagonal, creating an upper-triangular matrix.

Applying the mask in attention is straightforward:

In some cases, you might need to mask additional elements, such as padding tokens in the sequence. This can be done by setting the corresponding elements to −∞ in the mask tensor. While the example above shows a 2D tensor, when using both causal and padding masks, you’ll need to create a 3D tensor. In this case, each element in the batch has its own mask, and the first dimension of the mask tensor should match the batch dimension of the input tensors q, k, and v.

Your Task

Given the scaled_dot_product_attention() call above and a tensor q of shape (𝐵,𝐻,𝑁,𝐷) containing some padding tokens, how would you create a mask tensor of shape (𝐵,𝑁,𝑁) that combines both causal and padding masks to: (1) prevent attention to future tokens and (2) mask all attention operations involving padding tokens?

In the next lesson, you will learn about MLP sublayer.

Lesson 06: Mixture of Expert Models

Transformer models consist of stacked transformer blocks, where each block contains an attention sublayer and an MLP sublayer. The attention sublayer implements a multi-head attention mechanism, while the MLP sublayer is a feed-forward network.

The MLP sublayer introduces non-linearity to the model and is where much of the model’s “intelligence” resides. To enhance the model’s capabilities, you can either increase the size of the feed-forward network or employ a more sophisticated architecture like Mixture of Experts (MoE).

MoE is a recent innovation in transformer models. It consists of multiple parallel MLP sublayers with a router that selects a subset of them to process the input. The final output is a weighted sum of the outputs from the selected MLP sublayers. Many modern large language models use SwiGLU as their MLP sublayer, which combines three linear transformations with a SiLU activation function. Here’s how to implement it:

For example, in a system with 8 MLP sublayers, the router processes each input token using a linear layer to produce 8 scores. The top 2 scoring sublayers are selected to process the input, and their outputs are combined using weighted summation.

Since PyTorch doesn’t yet provide a built-in MoE layer, you need to implement it yourself. Here’s an implementation:

The forward() method first uses the router to generate top_k_indices and top_k_probs. Based on these indices, it selects and applies the corresponding experts to process the input. The results are combined using weighted summation with top_k_probs. The input is a 3D tensor of shape (batch_size, seq_len, hidden_dim), and since each token in a sequence can be processed by different experts, the method uses masking to correctly apply the weighted sum.

Your Task

Models like DeepSeek V2 incorporate a shared expert in their MoE architecture. It is an expert that processes every input regardless of routing. Can you modify the code above to include a shared expert?

In the next lesson, you will learn about normalization layers.

Lesson 07: RMS Norm and Skip Connections

A Transformer is a typical deep learning model that can easily stack hundreds of transformer blocks, with each block containing multiple operations.
Such deep models are sensitive to the vanishing gradient problem. Normalization layers are added to mitigate this issue and stabilize the training.

The two most common normalization layers in transformer models are Layer Norm and RMS Norm. We will use RMS Norm because it has fewer parameters. Using the built-in RMS Norm layer in PyTorch is straightforward:

There are two ways to use RMS Norm in a transformer model: pre-norm and post-norm. In pre-norm, you apply RMS Norm before the attention and feed-forward sublayers, while in post-norm, you apply it after. This difference becomes clear when considering the skip connections. Here’s an example of a decoder-only transformer block with pre-norm:

Each transformer block contains an attention sublayer (implemented using the GQA class from lesson 4) and a feed-forward sublayer (implemented using the MoE class from lesson 6), along with two RMS Norm layers.

In the forward() method, we first normalize the input before applying the attention sublayer. Then, for the skip connection, we add the original unnormalized input to the attention sublayer’s output. In a post-norm approach, we would instead apply attention to the unnormalized input and then normalize the tensor after the skip connection. Research has shown that the pre-norm approach provides more stable training.

Your Task

From the description above, how would you modify the code to make it a post-norm transformer block?

In the next lesson, you will learn to create the complete transformer model.

Lesson 08: The Complete Transformer Model

So far, you have created all the building blocks of the transformer model. You can build a complete transformer model by stacking these blocks together. Before doing that, let’s list out the design parameters by creating a dictionary for the model configuration:

The number of transformer blocks and the hidden dimension directly determine the model size. You can think of them as the “depth” and “width” of the model respectively. For each transformer block, you need to specify the number of attention heads (and in GQA, the number of key-value heads). Since we’re using the MoE model, you also need to define the total number of experts and the top-k value. Note that the MLP sublayer (implemented as SwiGLU) typically sets the intermediate dimension to 4 times the hidden dimension, so you don’t need to specify this separately.

The remaining hyperparameters don’t affect the model size: the maximum sequence length (which the rotary positional encoding depends on), the vocabulary size (which determines the embedding matrix dimensions), and the dropout rate used during training.

With these, you can create a transformer model. Let’s call it TextGenerationModel:

In this model, we create a single rotary position encoding module that’s reused across all transformer blocks. Since it’s a constant module, we only need one instance. The model begins with an embedding layer that converts token IDs into embedding vectors. These vectors are then processed through a series of transformer blocks. The output from the final transformer block remains a sequence of embedding vectors, which we normalize and project to vocabulary-sized logits using a linear layer. These logits represent probability distributions for predicting the next token in the sequence.

Your Task

The model is now complete. However, consider this question: Why does the forward() method accept a mask as an optional argument? If we’re using a causal mask, wouldn’t it make more sense to generate it internally within the model?

In the next lesson, you will learn to train the model.

Lesson 09: Training the Model

Now that you’ve built a model, let’s learn how to train it. In lesson 1, you prepared the dataset for training. The next step is to wrap the dataset as a PyTorch Dataset object:

This dataset is designed for model pre-training, where the task is to predict the next token in a sequence. The dataset object is a Python iterable that produces pairs of (x,y), where x is a sequence of token IDs with fixed length, and y is the corresponding next token. Since the training targets (y) are derived from the input data itself, this approach is called self-supervised learning.

Depending on your hardware, you can optimize the training speed and memory usage. If you have a GPU with limited memory, you can load the model onto the GPU and use half-precision (bfloat16) to reduce memory consumption. Here’s how:

If you still encounter out of memory error, you may want to reduce the model size or batch size.

You need to write a training loop to train the model. In PyTorch, you may do as follows:

While this training loop might differ from what you’ve used for other models, it follows best practices for training transformers. The code uses a cosine learning rate scheduler with a warm-up period—the learning rate gradually increases during warm-up and then decreases following a cosine curve.

To prevent gradient explosion, we implement gradient clipping, which stabilizes training by limiting drastic changes in model parameters.

The model functions as a next-token predictor, outputting a probability distribution over the entire vocabulary. Since this is essentially a classification task (predicting which token comes next), we use cross-entropy loss for training.

The training progress is monitored using tqdm, which displays the loss for each epoch. The model’s parameters are saved whenever the loss improves, ensuring we keep the best performing version.

Your Task

The training loop above runs for only two epochs. Consider why this number is relatively small, and what factors might make additional epochs unnecessary for this particular task.

In the next lesson, you will learn to use the model.

Lesson 10: Using the Model

After training the model, you can use it to generate text. To optimize performance, disable gradient computation in PyTorch. Additionally, since some modules like dropout behave differently during training and inference, switch the model to evaluation mode before use.

Let’s create a function for text generation that can be called multiple times to generate different samples:

The generate_text() function implements probabilistic sampling for token generation. Although the model outputs logits representing a probability distribution over the vocabulary, it doesn’t always select the most probable token. Instead, it uses the softmax function to convert logits to probabilities. The temperature parameter controls the sampling distribution: lower values make the model more conservative by emphasizing likely tokens, while higher values make it more creative by reducing the probability differences between tokens.

The function takes a partial sentence as a prompt string and generates a sequence of tokens using the model. Although the model is trained with batches, this function uses a batch size of 1 for simplicity. The final output is returned as a decoded string.

Your Task

Look at the code above: Why does the function need to determine the model’s device at the beginning?

The current implementation uses a simple sampling approach. An advanced technique called nucleus sampling (or top-p sampling) considers only the most likely tokens whose cumulative probability exceeds a threshold 𝑝. How would you modify the code to implement nucleus sampling?

This is the last lesson.

The End! (Look How Far You Have Come)

You made it. Well done!

Take a moment and look back at how far you have come.

  • You discovered what are transformer models and their architecture.
  • You learned how to build a transformer model from scratch.
  • You learned how to train and use a transformer model.

Don’t make light of this; you have come a long way in a short time. This is just the beginning of your transformer model journey. Keep practicing and developing your skills.

Summary

How did you do with the mini-course?
Did you enjoy this crash course?

Do you have any questions? Were there any sticking points?
Let me know. Leave a comment below.


Connect broadband

Build an Inference Cache to Save Costs in High-Traffic LLM Apps

  In this article, you will learn how to add both exact-match and semantic inference caching to large...