You’ve likely used ChatGPT, Gemini, or Grok, which demonstrate how large language models
can exhibit human-like intelligence. While creating a clone of these
large language models at home is unrealistic and unnecessary,
understanding how they work helps demystify their capabilities and
recognize their limitations.
All these modern large language models are decoder-only transformers.
Surprisingly, their architecture is not overly complex. While you may
not have extensive computational power and memory, you can still create a
smaller language model that mimics some capabilities of the
larger ones. By designing, building, and training such a scaled-down
version, you’ll better understand what the model is doing, rather than
simply viewing it as a black box labeled “AI.”
In this 10-part crash course, you’ll learn through examples how to
build and train a transformer model from scratch using PyTorch. The
mini-course focuses on model architecture, while advanced optimization
techniques, though important, are beyond our scope. We’ll guide you from
data collection through to running your trained model. Each lesson
covers a specific transformer component, explaining its role, design
parameters, and PyTorch implementation. By the end, you’ll have explored
every aspect of the model and gained a comprehensive understanding of
how transformer models work.
Let’s get started.

Building Transformer Models from Scratch with PyTorch (10-day Mini-Course)
Photo by Caleb Jack. Some rights reserved.
Who Is This Mini-Course For?
Before we begin, let’s make sure you’re in the right place. The list
below provides general guidelines on whom this course is designed for.
Don’t worry if you don’t match these points exactly—you might just need
to brush up on certain areas to keep up.
- Developers with some coding experience. You should
be comfortable writing Python code and setting up your development
environment (a prerequisite). You don’t need to be an expert coder, but
you should be able to install packages and write scripts without
hesitation.
- Developers with basic machine learning knowledge.
You should have a general understanding of machine learning models and
feel comfortable using them. You don’t need to be an expert, but you
should not be afraid to learn more about them.
- Developers familiar with PyTorch. This project is
based on PyTorch. To keep it concise, we will not cover the basics of
PyTorch. You are not required to be a PyTorch expert, but you are
expected to be able to read and understand PyTorch code, and more
importantly, know how to read the documentation of PyTorch in case you
encountered any functions that you are not familiar with.
This mini-course is not a textbook on transformer or LLM. Instead, it
serves as a project-based guide that takes you step by step from a
developer with minimal experience to one who can confidently demonstrate
how a transformer model is created.
Mini-Course Overview
This mini-course is divided into 10 parts.
Each lesson is designed to take about 30 minutes for the average
developer. While some lessons may be completed more quickly, others
might require more time if you choose to explore them in depth.
You can progress at your own pace. We recommend following a comfortable
schedule of one lesson per day over ten days to allow for proper
absorption of the material.
The topics you will cover over the next 10 lessons are as follows:
- Lesson 1: Getting the Data
- Lesson 2: Train a Tokenizer for Your Language Model
- Lesson 3: Positional Encoding
- Lesson 4: Grouped Query Attention
- Lesson 5: Causal Mask
- Lesson 6: Mixture of Expert Models
- Lesson 7: RMS Norm and Skip Connection
- Lesson 8: The Complete Transformer Model
- Lesson 9: Training the Model
- Lesson 10: Using the Model
This journey will be both challenging and rewarding.
While it requires dedication through reading, research, and programming,
the hands-on experience you’ll gain in building a transformer model
will be invaluable.
Post your results in the comments; I’ll cheer you on!
Hang in there; don’t give up.
You can download the code of this post here.
Lesson 01: Getting the Data
We are building a language model using transformer architecture. A
language model is a probabilistic representation of human language that
predicts the likelihood of words appearing in a sequence. Rather than
being manually constructed, these probabilities are learned from data.
Therefore, the first step in building a language model is to collect a
large corpus of text that captures the natural patterns of language use.
There are numerous sources of text data available. Project Gutenberg
is an excellent source of free text data, offering a wide variety of
books across different genres. Here’s how you can download text data
from Project Gutenberg to your local directory:
import os import requests DATASOURCE = { "memoirs_of_grant": "https://www.gutenberg.org/ebooks/4367.txt.utf-8", "frankenstein": "https://www.gutenberg.org/ebooks/84.txt.utf-8", "sleepy_hollow": "https://www.gutenberg.org/ebooks/41.txt.utf-8", "origin_of_species": "https://www.gutenberg.org/ebooks/2009.txt.utf-8", "makers_of_many_things": "https://www.gutenberg.org/ebooks/28569.txt.utf-8", "common_sense": "https://www.gutenberg.org/ebooks/147.txt.utf-8", "economic_peace": "https://www.gutenberg.org/ebooks/15776.txt.utf-8", "the_great_war_3": "https://www.gutenberg.org/ebooks/29265.txt.utf-8", "elements_of_style": "https://www.gutenberg.org/ebooks/37134.txt.utf-8", "problem_of_philosophy": "https://www.gutenberg.org/ebooks/5827.txt.utf-8", "nights_in_london": "https://www.gutenberg.org/ebooks/23605.txt.utf-8", } for filename, url in DATASOURCE.items(): if not os.path.exists(f"{filename}.txt"): response = requests.get(url) with open(f"{filename}.txt", "wb") as f: f.write(response.content) |
This code downloads each book as a separate text file. Since
Project Gutenberg provides pre-cleaned text, we only need to extract the
book contents and store them as a list of strings in Python:
# Read and preprocess the text def preprocess_gutenberg(filename): with open(filename, "r", encoding="utf-8") as f: text = f.read() # Find the start and end of the actual content start = text.find("*** START OF THE PROJECT GUTENBERG EBOOK") start = text.find("\n", start) + 1 end = text.find("*** END OF THE PROJECT GUTENBERG EBOOK") # Extract the main content text = text[start:end].strip() # Basic preprocessing # Remove multiple newlines and spaces text = "\n".join(line.strip() for line in text.split("\n") if line.strip()) return text def get_dataset_text(): all_text = [] for filename in DATASOURCE: text = preprocess_gutenberg(f"{filename}.txt") all_text.append(text) return all_text text = get_dataset_text() |
The preprocess_gutenberg() function removes the Project Gutenberg header and footer from each book and joins the lines into a single string. The get_dataset_text() function applies this preprocessing to all books and returns a list of strings, where each string represents a complete book.
Your Task
Try running the code above! While this small collection of books
would typically be insufficient for training a production-ready language
model, it serves as an excellent starting point for learning. Notice
that the books in the DATASOURCE dictionary span various genres. Can you think about why having diverse genres is important when building a language model?
In the next lesson, you will learn how to convert the textual data into numbers.
Lesson 02: Train a Tokenizer for Your Language Model
Computers operate on numbers, so text must be converted into
numerical form for processing. In a language model, we assign numbers to
“tokens,” and these thousands of distinct tokens form the model’s
vocabulary.
A simple approach would be to open a dictionary and assign a number
to each word. However, this naive method cannot handle unseen words
effectively. A better approach is to train an algorithm that processes
input text and breaks it down into tokens. This algorithm, called a tokenizer, splits text efficiently and can handle unseen words.
There are several approaches to training a tokenizer. Byte-pair
encoding (BPE) is one of the most popular methods used in modern LLMs.
Let’s use the tokenizer library to train a BPE tokenizer using the text we collected in the previous lesson:
tokenizer = tokenizers.Tokenizer(tokenizers.models.BPE()) tokenizer.pre_tokenizer = tokenizers.pre_tokenizers.ByteLevel(add_prefix_space=True) tokenizer.decoder = tokenizers.decoders.ByteLevel() VOCAB_SIZE = 10000 trainer = tokenizers.trainers.BpeTrainer( vocab_size=VOCAB_SIZE, special_tokens=["[pad]", "[eos]"], show_progress=True ) text = get_dataset_text() tokenizer.train_from_iterator(text, trainer=trainer) tokenizer.enable_padding(pad_id=tokenizer.token_to_id("[pad]"), pad_token="[pad]") # Save the trained tokenizer tokenizer.save("gutenberg_tokenizer.json", pretty=True) |
This example creates a small BPE tokenizer with a vocabulary
size of 10,000. Production LLMs typically use vocabularies that are
orders of magnitude larger for better language coverage. Even for this
toy project, training a tokenizer takes time as it analyzes character
collocations to form words. It’s recommended to save the tokenizer as a
JSON file, as shown above, so you can easily reload it later:
tokenizer = tokenizers.Tokenizer.from_file("gutenberg_tokenizer.json") |
Your Task
Besides BPE, WordPiece is another common tokenization algorithm. Try creating a WordPiece version of the tokenizer above.
Why is a vocabulary size of 10,000 insufficient for a good language
model? Research the number of words in a typical English dictionary and
explain the implications for language modeling.
In the next lesson, you’ll learn about positional encoding.
Lesson 03: Positional Encoding
Unlike recurrent neural networks, transformer models process entire
sequences simultaneously. However, this parallel processing means they
lack inherent understanding of token order. Since token position is
crucial for understanding context, transformer models incorporate
positional encodings into their input processing to capture this
sequential information.
While several positional encoding methods exist, Rotary Positional
Encoding (RoPE) has emerged as the most widely used approach. RoPE
operates by applying rotational transformations to the embedded token
vectors. Each token is represented as a vector, and the encoding process
involves multiplying pairs of vector elements by a 2 ×2 rotation matrix:
ˆ𝐱𝑚=𝐑𝑚𝐱𝑚=[cos(𝑚𝜃𝑖)−sin(𝑚𝜃𝑖)sin(𝑚𝜃𝑖)cos(𝑚𝜃𝑖)]𝐱𝑚
To implement RoPE, you can use the following PyTorch code:
def rotate_half(x): x1, x2 = x.chunk(2, dim=-1) return torch.cat((-x2, x1), dim=-1) def apply_rotary_pos_emb(x, cos, sin): return (x * cos) + (rotate_half(x) * sin) class RotaryPositionalEncoding(nn.Module): def __init__(self, dim, max_seq_len=1024): super().__init__() N = 10000 inv_freq = 1. / (N ** (torch.arange(0, dim, 2).float() / dim)) position = torch.arange(max_seq_len).float() inv_freq = torch.cat((inv_freq, inv_freq), dim=-1) sinusoid_inp = torch.outer(position, inv_freq) self.register_buffer("cos", sinusoid_inp.cos()) self.register_buffer("sin", sinusoid_inp.sin()) def forward(self, x, seq_len=None): if seq_len is None: seq_len = x.size(1) cos = self.cos[:seq_len].view(1, seq_len, 1, -1) sin = self.sin[:seq_len].view(1, seq_len, 1, -1) return apply_rotary_pos_emb(x, cos, sin) sequence = torch.randn(1, 10, 4, 128) rope = RotaryPositionalEncoding(128) new_sequence = rope(sequence) |
The RotaryPositionalEncoding module implements the positional encoding mechanism for input sequences. Its __init__ function pre-computes sine and cosine values for all possible positions and dimensions, while the forward function applies the rotation matrix to transform the input.
An important implementation detail is the use of register_buffer in the __init__
function to store sine and cosine values. This tells PyTorch to treat
these tensors as non-trainable model parameters, ensuring proper
management across different computing devices (e.g., GPU) and during
model serialization.
Your Task
Experiment with the code provided above. Earlier, we learned that
RoPE applies to embedded token vectors in a sequence. Take a closer look
at the input tensor sequence used to test the RotaryPositionalEncoding
module: why is it a 4D tensor? While the last dimension (128)
represents the embedding size, can you identify what the first three
dimensions (1, 10, 4) represent in the context of transformer
architecture?
In the next lesson, you will learn about the attention block.
Lesson 04: Grouped Query Attention
The signature component of a transformer model is its attention
mechanism. When processing a sequence of tokens, the attention mechanism
builds connections between tokens to understand their context.
The attention mechanism predates transformer models, and several
variants have evolved over time. In this lesson, you will learn to
implement Grouped Query Attention (GQA).
A transformer model begins with a sequence of embedded tokens, which
are essentially vectors. The modern attention mechanism computes an
output sequence based on three input sequences: query, key, and value.
These three sequences are derived from the input sequence through
different projections:
batch_size, seq_len, hidden_dim = x.shape q_proj = nn.Linear(hidden_dim, num_heads * head_dim) k_proj = nn.Linear(hidden_dim, num_kv_heads * head_dim) v_proj = nn.Linear(hidden_dim, num_kv_heads * head_dim) out_proj = nn.Linear(num_heads * head_dim, hidden_dim) q = q_proj(x).view(batch_size, seq_len, num_heads, head_dim).transpose(1, 2) k = k_proj(x).view(batch_size, seq_len, num_kv_heads, head_dim).transpose(1, 2) v = v_proj(x).view(batch_size, seq_len, num_kv_heads, head_dim).transpose(1, 2) output = F.scaled_dot_product_attention(q, k, v, enable_gqa=True) output = output.transpose(1, 2).reshape(batch_size, seq_len, hidden_dim).contiguous() output = out_proj(q) |
The projection is performed by a fully-connected neural
network layer that operates on the input tensor’s last dimension. As
shown above, the projection’s output is reshaped using view() and then transposed. The input tensor x is 3D, and the view()
function transforms it into a 4D tensor by splitting the last dimension
into two: the attention heads and the head dimension. The transpose() function then swaps the sequence length dimension with the attention head dimension.
The resulting 4D tensor has attention operations that only involve
the last two dimensions. The actual attention computation is performed
using PyTorch’s built-in scaled_dot_product_attention() function. The result is then reshaped back into a 3D tensor and projected to the original dimension.
This architecture is called grouped query attention because it uses
different numbers of heads for queries versus keys and values.
Typically, the number of query heads is a multiple of the number of
key-value heads.
Since we will use such attention mechanism a lot, let’s create a class for it:
class GQA(nn.Module): def __init__(self, hidden_dim, num_heads, num_kv_heads, dropout=0.1): super().__init__() self.num_heads = num_heads self.num_kv_heads = num_kv_heads self.head_dim = hidden_dim // num_heads self.num_groups = num_heads // num_kv_heads self.dropout = dropout self.q_proj = nn.Linear(hidden_dim, self.num_heads * self.head_dim) self.k_proj = nn.Linear(hidden_dim, self.num_kv_heads * self.head_dim) self.v_proj = nn.Linear(hidden_dim, self.num_kv_heads * self.head_dim) self.out_proj = nn.Linear(self.num_heads * self.head_dim, hidden_dim) def forward(self, q, k, v, mask=None, rope=None): q_batch_size, q_seq_len, hidden_dim = q.shape k_batch_size, k_seq_len, hidden_dim = k.shape v_batch_size, v_seq_len, hidden_dim = v.shape # projection q = self.q_proj(q).view(q_batch_size, q_seq_len, -1, self.head_dim).transpose(1, 2) k = self.k_proj(k).view(k_batch_size, k_seq_len, -1, self.head_dim).transpose(1, 2) v = self.v_proj(v).view(v_batch_size, v_seq_len, -1, self.head_dim).transpose(1, 2) # apply rotary positional encoding if rope: q = rope(q) k = rope(k) # compute grouped query attention q = q.contiguous() k = k.contiguous() v = v.contiguous() output = F.scaled_dot_product_attention(q, k, v, attn_mask=mask, dropout_p=self.dropout, enable_gqa=True) output = output.transpose(1, 2).reshape(q_batch_size, q_seq_len, hidden_dim).contiguous() output = self.out_proj(output) return output |
The forward function includes two optional arguments: mask and rope. The rope argument expects a module that applies rotary positional encoding, which was covered in the previous lesson. The mask argument will be explained in the next lesson.
Your Task
Consider why this implementation is called grouped query attention. The original transformer architecture uses multihead attention. How would you modify this grouped query attention implementation to create a multihead attention mechanism?
In the next lesson, you’ll learn about masking in attention operations.
Lesson 05: Causal Mask
A key characteristic of decoder-only transformer models is the use of
causal masks in their attention layers. A causal mask is a matrix
applied during attention score calculation to prevent the model from attending to future tokens. Specifically, a query token 𝑖 can only attend to key tokens 𝑗 where 𝑗 ≤𝑖.
With query and key sequences of length 𝑁, the causal mask is a square matrix of shape (𝑁,𝑁). The element (𝑖,𝑗) indicates whether query token 𝑖 can attend to the key token 𝑗.
In a boolean mask matrix, the element (𝑖,𝑗) is True for 𝑖 ≤𝑗, making all elements on and below the diagonal True. However, we typically use a floating-point matrix because we can simply add it to the attention score matrix before applying softmax normalization. In this case, elements where 𝑖 ≤𝑗 are set to 0, and all other elements are set to −∞.
Creating such a causal mask is straightforward in PyTorch:
mask = torch.triu(torch.full((N, N), float('-inf')), diagonal=1) |
This creates a matrix of shape (𝑁,𝑁) filled with −∞, then uses the triu() function to zero out all elements on and below the diagonal, creating an upper-triangular matrix.
Applying the mask in attention is straightforward:
output = F.scaled_dot_product_attention(q, k, v, attn_mask=mask, enable_gqa=True) |
In some cases, you might need to mask additional elements,
such as padding tokens in the sequence. This can be done by setting the
corresponding elements to −∞
in the mask tensor. While the example above shows a 2D tensor, when
using both causal and padding masks, you’ll need to create a 3D tensor.
In this case, each element in the batch has its own mask, and the first
dimension of the mask tensor should match the batch dimension of the
input tensors q, k, and v.
Your Task
Given the scaled_dot_product_attention() call above and a tensor q of shape (𝐵,𝐻,𝑁,𝐷) containing some padding tokens, how would you create a mask tensor of shape (𝐵,𝑁,𝑁)
that combines both causal and padding masks to: (1) prevent attention
to future tokens and (2) mask all attention operations involving padding
tokens?
In the next lesson, you will learn about MLP sublayer.
Lesson 06: Mixture of Expert Models
Transformer models consist of stacked transformer blocks, where each
block contains an attention sublayer and an MLP sublayer. The attention
sublayer implements a multi-head attention mechanism, while the MLP
sublayer is a feed-forward network.
The MLP sublayer introduces non-linearity to the model and is where
much of the model’s “intelligence” resides. To enhance the model’s
capabilities, you can either increase the size of the feed-forward
network or employ a more sophisticated architecture like Mixture of
Experts (MoE).
MoE is a recent innovation in transformer models. It consists of
multiple parallel MLP sublayers with a router that selects a subset of
them to process the input. The final output is a weighted sum of the
outputs from the selected MLP sublayers. Many modern large language
models use SwiGLU as their MLP sublayer, which combines three linear
transformations with a SiLU activation function. Here’s how to implement
it:
class SwiGLU(nn.Module): def __init__(self, hidden_dim, intermediate_dim): super().__init__() self.gate = nn.Linear(hidden_dim, intermediate_dim) self.up = nn.Linear(hidden_dim, intermediate_dim) self.down = nn.Linear(intermediate_dim, hidden_dim) self.act = nn.SiLU() def forward(self, x): x = self.act(self.gate(x)) * self.up(x) x = self.down(x) return x |
For example, in a system with 8 MLP sublayers, the router
processes each input token using a linear layer to produce 8 scores. The
top 2 scoring sublayers are selected to process the input, and their
outputs are combined using weighted summation.
Since PyTorch doesn’t yet provide a built-in MoE layer, you need to implement it yourself. Here’s an implementation:
class MoELayer(nn.Module): def __init__(self, hidden_dim, intermediate_dim, num_experts, top_k=2): super().__init__() self.num_experts = num_experts self.top_k = top_k # Create expert networks self.experts = nn.ModuleList([ SwiGLU(hidden_dim, intermediate_dim) for _ in range(num_experts) ]) self.router = nn.Linear(hidden_dim, num_experts) def forward(self, hidden_states): batch_size, seq_len, hidden_dim = hidden_states.shape # Reshape for expert processing, then compute routing probabilities hidden_states_reshaped = hidden_states.view(-1, hidden_dim) # shape of router_logits: (batch_size * seq_len, num_experts) router_logits = self.router(hidden_states_reshaped) # Select top-k experts, then softmax output probabilities will sum to 1 # output shape: (batch_size * seq_len, k) top_k_logits, top_k_indices = torch.topk(router_logits, self.top_k, dim=-1) top_k_probs = F.softmax(top_k_logits, dim=-1) # Allocate output tensor output = torch.zeros(batch_size * seq_len, hidden_dim, device=hidden_states.device, dtype=hidden_states.dtype) # Process through selected experts unique_experts = torch.unique(top_k_indices) for i in unique_experts: expert_id = int(i) # token_mask (boolean tensor) = which token of the input should use this expert # token_mask shape: (batch_size * seq_len,) mask = (top_k_indices == expert_id) token_mask = mask.any(dim=1) assert token_mask.any(), f"Expecting some tokens using expert {expert_id}" # select tokens, apply the expert, then add to the output expert_input = hidden_states_reshaped[token_mask] expert_weight = top_k_probs[mask].unsqueeze(-1) # shape: (N, 1) expert_output = self.experts[expert_id](expert_input) # shape: (N, hidden_dim) output[token_mask] += expert_output * expert_weight # Reshape back to original shape output = output.view(batch_size, seq_len, hidden_dim) return output |
The forward() method first uses the router to generate top_k_indices and top_k_probs.
Based on these indices, it selects and applies the corresponding
experts to process the input. The results are combined using weighted
summation with top_k_probs. The input is a 3D tensor of shape (batch_size, seq_len, hidden_dim),
and since each token in a sequence can be processed by different
experts, the method uses masking to correctly apply the weighted sum.
Your Task
Models like DeepSeek V2 incorporate a shared expert in their
MoE architecture. It is an expert that processes every input regardless
of routing. Can you modify the code above to include a shared expert?
In the next lesson, you will learn about normalization layers.
Lesson 07: RMS Norm and Skip Connections
A Transformer is a typical deep learning model that can easily stack
hundreds of transformer blocks, with each block containing multiple
operations.
Such deep models are sensitive to the vanishing gradient problem.
Normalization layers are added to mitigate this issue and stabilize the
training.
The two most common normalization layers in transformer models are
Layer Norm and RMS Norm. We will use RMS Norm because it has fewer
parameters. Using the built-in RMS Norm layer in PyTorch is
straightforward:
rms_norm = nn.RMSNorm(hidden_dim) output_rms = rms_norm(x) |
There are two ways to use RMS Norm in a transformer model:
pre-norm and post-norm. In pre-norm, you apply RMS Norm before the
attention and feed-forward sublayers, while in post-norm, you apply it
after. This difference becomes clear when considering the skip
connections. Here’s an example of a decoder-only transformer block with
pre-norm:
class DecoderLayer(nn.Module): def __init__(self, hidden_dim, num_heads, num_kv_heads, moe_experts, moe_topk, dropout=0.1): super().__init__() self.self_attn = GQA(hidden_dim, num_heads, num_kv_heads, dropout) self.mlp = MoELayer(hidden_dim, 4 * hidden_dim, moe_experts, moe_topk) self.norm1 = nn.RMSNorm(hidden_dim) self.norm2 = nn.RMSNorm(hidden_dim) def forward(self, x, mask=None, rope=None): # self-attention sublayer out = self.norm1(x) out = self.self_attn(out, out, out, mask, rope) x = out + x # MLP sublayer out = self.norm2(x) out = self.mlp(out) return out + x |
Each transformer block contains an attention sublayer
(implemented using the GQA class from lesson 4) and a feed-forward
sublayer (implemented using the MoE class from lesson 6), along with two
RMS Norm layers.
In the forward() method, we first normalize the input
before applying the attention sublayer. Then, for the skip connection,
we add the original unnormalized input to the attention sublayer’s
output. In a post-norm approach, we would instead apply attention to the
unnormalized input and then normalize the tensor after the skip
connection. Research has shown that the pre-norm approach provides more
stable training.
Your Task
From the description above, how would you modify the code to make it a post-norm transformer block?
In the next lesson, you will learn to create the complete transformer model.
Lesson 08: The Complete Transformer Model
So far, you have created all the building blocks of the transformer
model. You can build a complete transformer model by stacking these
blocks together. Before doing that, let’s list out the design parameters
by creating a dictionary for the model configuration:
model_config = { "num_layers": 8, "num_heads": 8, "num_kv_heads": 4, "hidden_dim": 768, "moe_experts": 8, "moe_topk": 3, "max_seq_len": 512, "vocab_size": len(tokenizer.get_vocab()), "dropout": 0.1, } |
The number of transformer blocks and the hidden dimension
directly determine the model size. You can think of them as the “depth”
and “width” of the model respectively. For each transformer block, you
need to specify the number of attention heads (and in GQA, the number of
key-value heads). Since we’re using the MoE model, you also need to
define the total number of experts and the top-k value. Note that the
MLP sublayer (implemented as SwiGLU) typically sets the intermediate
dimension to 4 times the hidden dimension, so you don’t need to specify
this separately.
The remaining hyperparameters don’t affect the model size: the
maximum sequence length (which the rotary positional encoding depends
on), the vocabulary size (which determines the embedding matrix
dimensions), and the dropout rate used during training.
With these, you can create a transformer model. Let’s call it TextGenerationModel:
class TextGenerationModel(nn.Module): def __init__(self, num_layers, num_heads, num_kv_heads, hidden_dim, moe_experts, moe_topk, max_seq_len, vocab_size, dropout=0.1): super().__init__() self.rope = RotaryPositionalEncoding(hidden_dim // num_heads, max_seq_len) self.embedding = nn.Embedding(vocab_size, hidden_dim) self.decoders = nn.ModuleList([ DecoderLayer(hidden_dim, num_heads, num_kv_heads, moe_experts, moe_topk, dropout) for _ in range(num_layers) ]) self.norm = nn.RMSNorm(hidden_dim) self.out = nn.Linear(hidden_dim, vocab_size) def forward(self, ids, mask=None): x = self.embedding(ids) for decoder in self.decoders: x = decoder(x, mask, self.rope) x = self.norm(x) return self.out(x) model = TextGenerationModel(**model_config) |
In this model, we create a single rotary position encoding
module that’s reused across all transformer blocks. Since it’s a
constant module, we only need one instance. The model begins with an
embedding layer that converts token IDs into embedding vectors. These
vectors are then processed through a series of transformer blocks. The
output from the final transformer block remains a sequence of embedding
vectors, which we normalize and project to vocabulary-sized logits using
a linear layer. These logits represent probability distributions for
predicting the next token in the sequence.
Your Task
The model is now complete. However, consider this question: Why does the forward()
method accept a mask as an optional argument? If we’re using a causal
mask, wouldn’t it make more sense to generate it internally within the
model?
In the next lesson, you will learn to train the model.
Lesson 09: Training the Model
Now that you’ve built a model, let’s learn how to train it. In lesson
1, you prepared the dataset for training. The next step is to wrap the
dataset as a PyTorch Dataset object:
class GutenbergDataset(torch.utils.data.Dataset): def __init__(self, text, tokenizer, seq_len=512): self.seq_len = seq_len # Encode the entire text self.encoded = tokenizer.encode(text).ids def __len__(self): return len(self.encoded) - self.seq_len def __getitem__(self, idx): chunk = self.encoded[idx:idx + self.seq_len + 1] # +1 for target x = torch.tensor(chunk[:-1]) y = torch.tensor(chunk[1:]) return x, y BATCH_SIZE = 32 text = "\n".join(get_dataset_text()) dataset = GutenbergDataset(text, tokenizer, seq_len=model_config["max_seq_len"]) dataloader = torch.utils.data.DataLoader(dataset, batch_size=BATCH_SIZE, shuffle=True) |
This dataset is designed for model pre-training, where the task is to predict the next token in a sequence. The dataset
object is a Python iterable that produces pairs of (x,y), where x is a
sequence of token IDs with fixed length, and y is the corresponding next
token. Since the training targets (y) are derived from the input data
itself, this approach is called self-supervised learning.
Depending on your hardware, you can optimize the training speed and
memory usage. If you have a GPU with limited memory, you can load the
model onto the GPU and use half-precision (bfloat16) to reduce memory
consumption. Here’s how:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') model = model.to(device).to(torch.bfloat16) |
If you still encounter out of memory error, you may want to reduce the model size or batch size.
You need to write a training loop to train the model. In PyTorch, you may do as follows:
N_EPOCHS = 2 LR = 0.0005 WARMUP_STEPS = 2000 CLIP_NORM = 6.0 optimizer = optim.AdamW(model.parameters(), lr=LR) loss_fn = nn.CrossEntropyLoss(ignore_index=tokenizer.token_to_id("[pad]")) # Learning rate scheduling warmup_scheduler = optim.lr_scheduler.LinearLR( optimizer, start_factor=0.01, end_factor=1.0, total_iters=WARMUP_STEPS) cosine_scheduler = optim.lr_scheduler.CosineAnnealingLR( optimizer, T_max=N_EPOCHS * len(dataloader) - WARMUP_STEPS, eta_min=0) scheduler = optim.lr_scheduler.SequentialLR( optimizer, schedulers=[warmup_scheduler, cosine_scheduler], milestones=[WARMUP_STEPS]) print(f"Training for {N_EPOCHS} epochs with {len(dataloader)} steps per epoch") best_loss = float('inf') for epoch in range(N_EPOCHS): model.train() epoch_loss = 0 progress_bar = tqdm.tqdm(dataloader, desc=f"Epoch {epoch+1}/{N_EPOCHS}") for x, y in progress_bar: x = x.to(device) y = y.to(device) # Create causal mask mask = create_causal_mask(x.shape[1], device, torch.bfloat16) # Forward pass optimizer.zero_grad() outputs = model(x, mask.unsqueeze(0)) # Compute loss loss = loss_fn(outputs.view(-1, outputs.shape[-1]), y.view(-1)) # Backward pass loss.backward() torch.nn.utils.clip_grad_norm_( model.parameters(), CLIP_NORM, error_if_nonfinite=True ) optimizer.step() scheduler.step() epoch_loss += loss.item() # Show loss in tqdm progress_bar.set_postfix(loss=loss.item()) avg_loss = epoch_loss / len(dataloader) print(f"Epoch {epoch+1}/{N_EPOCHS}; Avg loss: {avg_loss:.4f}") # Save checkpoint if loss improved if avg_loss < best_loss: best_loss = avg_loss torch.save(model.state_dict(), "textgen_model.pth") |
While this training loop might differ from what you’ve used
for other models, it follows best practices for training transformers.
The code uses a cosine learning rate scheduler with a warm-up period—the
learning rate gradually increases during warm-up and then decreases
following a cosine curve.
To prevent gradient explosion, we implement gradient clipping, which
stabilizes training by limiting drastic changes in model parameters.
The model functions as a next-token predictor, outputting a
probability distribution over the entire vocabulary. Since this is
essentially a classification task (predicting which token comes next),
we use cross-entropy loss for training.
The training progress is monitored using tqdm, which displays the
loss for each epoch. The model’s parameters are saved whenever the loss
improves, ensuring we keep the best performing version.
Your Task
The training loop above runs for only two epochs. Consider why this
number is relatively small, and what factors might make additional
epochs unnecessary for this particular task.
In the next lesson, you will learn to use the model.
Lesson 10: Using the Model
After training the model, you can use it to generate text. To
optimize performance, disable gradient computation in PyTorch.
Additionally, since some modules like dropout behave differently during
training and inference, switch the model to evaluation mode before use.
Let’s create a function for text generation that can be called multiple times to generate different samples:
def generate_text(model, tokenizer, prompt, max_length=100, temperature=1.0): model.eval() device = next(model.parameters()).device # Encode the prompt, set tensor to batch size of 1 input_ids = torch.tensor(tokenizer.encode(prompt).ids).unsqueeze(0).to(device) with torch.no_grad(): for _ in range(max_length): # Get model predictions for the next token as the last element of the output outputs = model(input_ids) next_token_logits = outputs[:, -1, :] / temperature # Sample from the distribution probs = F.softmax(next_token_logits, dim=-1) next_token = torch.multinomial(probs, num_samples=1) # Append to input_ids input_ids = torch.cat([input_ids, next_token], dim=1) # Stop if we predict the end token if next_token[0].item() == tokenizer.token_to_id("[eos]"): break return tokenizer.decode(input_ids[0].tolist()) # Test the model with some prompts test_prompts = [ "Once upon a time,", "We the people of the", "In the beginning was the", ] print("\nGenerating sample texts:") for prompt in test_prompts: generated = generate_text(model, tokenizer, prompt) print(f"\nPrompt: {prompt}") print(f"Generated: {generated}") print("-" * 80) |
The generate_text() function implements
probabilistic sampling for token generation. Although the model outputs
logits representing a probability distribution over the vocabulary, it
doesn’t always select the most probable token. Instead, it uses the
softmax function to convert logits to probabilities. The temperature
parameter controls the sampling distribution: lower values make the
model more conservative by emphasizing likely tokens, while higher
values make it more creative by reducing the probability differences
between tokens.
The function takes a partial sentence as a prompt string
and generates a sequence of tokens using the model. Although the model
is trained with batches, this function uses a batch size of 1 for
simplicity. The final output is returned as a decoded string.
Your Task
Look at the code above: Why does the function need to determine the model’s device at the beginning?
The current implementation uses a simple sampling approach. An advanced technique called nucleus sampling (or top-p sampling) considers only the most likely tokens whose cumulative probability exceeds a threshold 𝑝. How would you modify the code to implement nucleus sampling?
This is the last lesson.
The End! (Look How Far You Have Come)
You made it. Well done!
Take a moment and look back at how far you have come.
- You discovered what are transformer models and their architecture.
- You learned how to build a transformer model from scratch.
- You learned how to train and use a transformer model.
Don’t make light of this; you have come a long way in a short time.
This is just the beginning of your transformer model journey. Keep
practicing and developing your skills.
Summary
How did you do with the mini-course?
Did you enjoy this crash course?
Do you have any questions? Were there any sticking points?
Let me know. Leave a comment below.