Artificial Intelligence , Machine Learning and Data Science Hubspot

Unlock the Power of Artificial Intelligence, Machine Learning, and Data Science with our Blog Discover the latest insights, trends, and innovations in Artificial Intelligence (AI), Machine Learning (ML), and Data Science through our informative and engaging Hubspot blog. Gain a deep understanding of how these transformative technologies are shaping industries and revolutionizing the way we work. Stay updated with cutting-edge advancements, practical applications, and real-world use.

Thursday, 19 March 2026

Building a Decoder-Only Transformer Model Like Llama-2 and Llama-3

The large language models today are a simplified form of the transformer model. They are called decoder-only models because their role is similar to the decoder part of the transformer, which generates an output sequence given a partial sequence as input. Architecturally, they are closer to the encoder part of the transformer model. In this post, you will build a decoder-only transformer model for text generation in the same architecture as Meta’s Llama-2 or Llama-3. Specifically, you will learn:

How to build a decoder-only model
The variations in the architecture design of the decoder-only model
How to train the model

Kick-start your project with my book Building Transformer Models From Scratch with PyTorch. It provides self-study tutorials with working code.

Let’s get started.

Building a Decoder-Only Transformer Model for Text Generation
Photo by Jay. Some rights reserved.

Overview

This post is divided into five parts; they are:

From a Full Transformer to a Decoder-Only Model
Building a Decoder-Only Model
Data Preparation for Self-Supervised Learning
Training the Model
Extensions

From a Full Transformer to a Decoder-Only Model

The transformer model originated as a sequence-to-sequence (seq2seq) model that converts an input sequence into a context vector, which is then used to generate a new sequence. In this architecture, the encoder part is responsible for converting the input sequence into a context vector, while the decoder part generates the new sequence from this context vector.

Instead of using the context vector to generate an entirely new sequence, can we project it into a vector of logits representing probabilities for each token in the vocabulary? This way, given a partial sequence as input, the model can predict the next most likely token. By iteratively feeding the sequence back into the model, we can generate coherent text one token at a time, much like auto-complete functions in text editors. This simplified architecture, which focuses solely on predicting the next token, is called a decoder-only model.

Building a Decoder-Only Model

A decoder-only model has a simpler architecture than a full transformer model. Starting with the full transformer architecture discussed in the previous post, you can create a decoder-only model by removing the encoder component entirely and adapting the decoder for standalone operation.

class DecoderLayer(nn.Module):

def __init__(self, hidden_dim, num_heads, num_kv_heads, dropout=0.1):

super().__init__()

self.self_attn = GQA(hidden_dim, num_heads, num_kv_heads, dropout)

self.mlp = SwiGLU(hidden_dim, 4 * hidden_dim)

self.norm1 = nn.RMSNorm(hidden_dim)

self.norm2 = nn.RMSNorm(hidden_dim)

def forward(self, x, mask=None, rope=None):

# self-attention sublayer

out = self.norm1(x)

out = self.self_attn(out, out, out, mask, rope)

x = out + x

# MLP sublayer

out = self.norm2(x)

out = self.mlp(out)

return out + x

class TextGenerationModel(nn.Module):

def __init__(self, num_layers, num_heads, num_kv_heads, hidden_dim,

max_seq_len, vocab_size, dropout=0.1):

super().__init__()

self.rope = RotaryPositionalEncoding(hidden_dim // num_heads, max_seq_len)

self.embedding = nn.Embedding(vocab_size, hidden_dim)

self.decoders = nn.ModuleList([

DecoderLayer(hidden_dim, num_heads, num_kv_heads, dropout)

for _ in range(num_layers)

])

self.norm = nn.RMSNorm(hidden_dim)

self.out = nn.Linear(hidden_dim, vocab_size)

def forward(self, ids, mask=None):

x = self.embedding(ids)

for decoder in self.decoders:

x = decoder(x, mask, self.rope)

x = self.norm(x)

return self.out(x)

The implementation reuses a significant portion of the code from the full transformer model. The DecoderLayer class shares the same structure as the EncoderLayer from the previous implementation. The TextGenerationModel class features a simplified forward() method since it no longer needs to handle encoder-decoder interactions. It simply converts input token IDs into embeddings, processes them through the stacked decoder layers, and projects the output into logits representing probabilities for each token in the vocabulary.

In the picture, the model is like the following. This shares the same architectural design of Llama-2/Llama-3 models proposed by Meta:

Decoder-Only Model Following the Architecture of Llama-2/Llama-3

Data Preparation for Self-Supervised Learning

Our goal is to create a model that can generate coherent paragraphs of text from a given prompt, even if that prompt is just a single word. To train such a model effectively, we need to consider our training approach and data requirements carefully.

The training technique we’ll use is called self-supervised learning. Unlike traditional supervised learning, which requires manually labeled data, self-supervised learning leverages the inherent structure of the text itself. When we input a sequence of text, the model learns to predict the next token, and the actual next token in the text serves as the ground truth. This eliminates the need for manual labeling.

The size of the training dataset is crucial. With a vocabulary size of 𝑁 tokens and a dataset containing 𝑀 words, each token appears approximately 𝑀/𝑁 times on average. To ensure the model learns meaningful representations for all tokens, this ratio needs to be sufficiently large.

In this post, you will download some novels from Project Gutenberg and use them as the dataset to train the model.

import os

import requests

# Download novels from Project Gutenberg

DATASOURCE = {

"moby_dick": "https://www.gutenberg.org/ebooks/2701.txt.utf-8",

"frankenstein": "https://www.gutenberg.org/ebooks/84.txt.utf-8",

"dracula": "https://www.gutenberg.org/ebooks/345.txt.utf-8",

"little_women": "https://www.gutenberg.org/ebooks/37106.txt.utf-8",

"pride_and_prejudice": "https://www.gutenberg.org/ebooks/1342.txt.utf-8",

"alice_in_wonderland": "https://www.gutenberg.org/ebooks/11.txt.utf-8",

"crime_and_punishment": "https://www.gutenberg.org/ebooks/2554.txt.utf-8",

"tom_sawyer": "https://www.gutenberg.org/ebooks/74.txt.utf-8",

"tale_of_two_cities": "https://www.gutenberg.org/ebooks/98.txt.utf-8",

"sherlock_holmes": "https://www.gutenberg.org/ebooks/1661.txt.utf-8",

"war_and_peace": "https://www.gutenberg.org/ebooks/2600.txt.utf-8",

}

for filename, url in DATASOURCE.items():

if not os.path.exists(f"{filename}.txt"):

response = requests.get(url)

with open(f"{filename}.txt", "wb") as f:

f.write(response.content)

These public domain novels, written by various authors across different genres, provide a diverse dataset that will help our model learn a wide range of vocabulary and writing styles.

With these novels downloaded, you can extract the main context as a string and keep these strings as a list:

# Read and preprocess the text

def preprocess_gutenberg(filename):

with open(filename, "r", encoding="utf-8") as f:

text = f.read()

# Find the start and end of the actual content

start = text.find("*** START OF THE PROJECT GUTENBERG EBOOK")

start = text.find("\n", start) + 1

end = text.find("*** END OF THE PROJECT GUTENBERG EBOOK")

# Extract the main content

text = text[start:end].strip()

# Basic preprocessing

# Remove multiple newlines and spaces

text = "\n".join(line.strip() for line in text.split("\n") if line.strip())

return text

def get_dataset_text():

all_text = []

for filename in DATASOURCE:

text = preprocess_gutenberg(f"{filename}.txt")

all_text.append(text)

return all_text

The next step is to create a tokenizer. You can build a naive tokenizer by splitting the text into words. You can also use the Byte-Pair Encoding (BPE) algorithm to create a more sophisticated tokenizer, as follows:

import tokenizers

# Tokenization with BPE

tokenizer = tokenizers.Tokenizer(tokenizers.models.BPE())

# Configure pre-tokenizer add space at beginning of the sentence

tokenizer.pre_tokenizer = tokenizers.pre_tokenizers.ByteLevel(add_prefix_space=True)

# Configure decoder so that would boundary symbol will be removed

tokenizer.decoder = tokenizers.decoders.ByteLevel()

# Train BPE

VOCAB_SIZE = 10000

trainer = tokenizers.trainers.BpeTrainer(

vocab_size=VOCAB_SIZE,

special_tokens=["[pad]", "[eos]"],

show_progress=True

)

text = get_dataset_text()

tokenizer.train_from_iterator(text, trainer=trainer)

tokenizer.enable_padding(pad_id=tokenizer.token_to_id("[pad]"), pad_token="[pad]")

# Save the trained tokenizer

tokenizer.save("gutenberg_tokenizer.json", pretty=True)

This uses the tokenizers library to train a BPE tokenizer. You called get_dataset_text() to get the text of all the novels and then train the tokenizer on it. You also need two special tokens: [pad] and [eos]. Most importantly, the [eos] token is used to indicate the end of the sequence. If your model generates this token, you know you can stop the generation.

Training the Model

With the tokenizer and the dataset ready, you can now train the model.

First, you need to create a Dataset object that can be used to train the model. PyTorch provides a framework for this.

import torch

class GutenbergDataset(torch.utils.data.Dataset):

def __init__(self, text, tokenizer, seq_len=512):

self.seq_len = seq_len

# Encode the entire text

self.encoded = tokenizer.encode(text).ids

def __len__(self):

return len(self.encoded) - self.seq_len

def __getitem__(self, idx):

chunk = self.encoded[idx:idx + self.seq_len + 1] # +1 for target

x = torch.tensor(chunk[:-1])

y = torch.tensor(chunk[1:])

return x, y

BATCH_SIZE = 32

text = "\n".join(get_dataset_text())

dataset = GutenbergDataset(text, tokenizer, seq_len=model_config["max_seq_len"])

dataloader = torch.utils.data.DataLoader(dataset, batch_size=BATCH_SIZE, shuffle=True)

This Dataset object is used to create a DataLoader object that can be used to train the model. The DataLoader object will automatically batch the data and shuffle it.

The Dataset object produces a pair of input and output sequences in the __getitem__() method. They are of the same length but offset by one token. When the input sequence is passed to the model, the model generates the next token for each position in the sequence. Hence, the ground truth output is from the same source, offset by one. This is how you can set up the self-supervised training.

Now you can create the model and train it. You can use this code to create a very large model. However, if you do not expect the model to be very powerful, you can design a smaller one. Let’s make one with:

8 layers
Attention uses 8 query heads and 4 key-value heads
Hidden dimension is 768
Maximum sequence length is 512
Set dropout at attention to 0.1
Train with AdamW optimizer with initial learning rate 0.0005
Learning rate scheduler with 2000 steps of warmup and then cosine annealing
Train for 2 epochs with batch size 32, clip norm to 6.0

Everything above is typical. Training a decoder-only model typically requires a very large dataset, and the number of epochs may be as few as 1. It is the number of steps trained that matters. The training will use a linear warmup to gradually increase the learning rate at the beginning, which can reduce the effect of how the model is initialized. Then the cosine annealing will gradually decrease the learning rate such that at the end of the training, when the model is almost converged, it keeps the learning rate at a very small value to stabilize the result.

The code for model creation and training is as follows:

# Training configuration

model_config = {

"num_layers": 8,

"num_heads": 8,

"num_kv_heads": 4,

"hidden_dim": 768,

"max_seq_len": 512,

"vocab_size": len(tokenizer.get_vocab()),

"dropout": 0.1,

}

# Initialize model, optimizer, etc.

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

model = TextGenerationModel(**model_config).to(device)

# Create dataset and dataloader

BATCH_SIZE = 32

text = "\n".join(get_dataset_text())

dataset = GutenbergDataset(text, tokenizer, seq_len=model_config["max_seq_len"])

dataloader = torch.utils.data.DataLoader(dataset, batch_size=BATCH_SIZE, shuffle=True)

# Training loop

N_EPOCHS = 2

LR = 0.0005

WARMUP_STEPS = 2000

CLIP_NORM = 6.0

optimizer = optim.AdamW(model.parameters(), lr=LR)

loss_fn = nn.CrossEntropyLoss(ignore_index=tokenizer.token_to_id("[pad]"))

# Learning rate scheduling

warmup_scheduler = optim.lr_scheduler.LinearLR(

optimizer, start_factor=0.01, end_factor=1.0, total_iters=WARMUP_STEPS)

cosine_scheduler = optim.lr_scheduler.CosineAnnealingLR(

optimizer, T_max=N_EPOCHS * len(dataloader) - WARMUP_STEPS, eta_min=0)

scheduler = optim.lr_scheduler.SequentialLR(

optimizer, schedulers=[warmup_scheduler, cosine_scheduler],

milestones=[WARMUP_STEPS])

print(f"Training for {N_EPOCHS} epochs with {len(dataloader)} steps per epoch")

best_loss = float('inf')

for epoch in range(N_EPOCHS):

model.train()

epoch_loss = 0

for x, y in dataloader:

x = x.to(device)

y = y.to(device)

# Create causal mask

mask = create_causal_mask(x.shape[1], device)

# Forward pass

optimizer.zero_grad()

outputs = model(x, mask.unsqueeze(0))

# Compute loss

loss = loss_fn(outputs.view(-1, outputs.shape[-1]), y.view(-1))

# Backward pass

loss.backward()

torch.nn.utils.clip_grad_norm_(

model.parameters(), CLIP_NORM, error_if_nonfinite=True

)

optimizer.step()

scheduler.step()

epoch_loss += loss.item()

avg_loss = epoch_loss / len(dataloader)

print(f"Epoch {epoch+1}/{N_EPOCHS}; Avg loss: {avg_loss:.4f}")

# Save checkpoint if loss improved

if avg_loss < best_loss:

best_loss = avg_loss

torch.save(model.state_dict(), "textgen_model.pth")

In the training loop, you did the usual forward and backward passes. The model will be saved whenever the loss is improved. For simplicity, no evaluation is implemented. You should evaluate the model regularly (not necessarily after every epoch) to monitor progress.

Due to the large vocabulary size and sequence length, the training process is computationally intensive. Even on a high-end RTX 4090 GPU, each epoch takes approximately 10 hours to complete.

Once the training is done, you can load the model and generate text:

# Generation function

def generate_text(model, tokenizer, prompt, max_length=100, temperature=0.7):

model.eval()

device = next(model.parameters()).device

# Encode the prompt

input_ids = torch.tensor(tokenizer.encode(prompt).ids).unsqueeze(0).to(device)

with torch.no_grad():

for _ in range(max_length):

# Get model predictions for the next token as the last element of the output

outputs = model(input_ids)

next_token_logits = outputs[:, -1, :] / temperature

# Sample from the distribution

probs = F.softmax(next_token_logits, dim=-1)

next_token = torch.multinomial(probs, num_samples=1)

# Append to input_ids

input_ids = torch.cat([input_ids, next_token], dim=1)

# Stop if we predict the end token

if next_token[0].item() == tokenizer.token_to_id("[eos]"):

break

return tokenizer.decode(input_ids[0].tolist())

# Test the model with some prompts

test_prompts = [

"Once upon a time,",

"We the people of the",

"In the beginning was the",

]

print("\nGenerating sample texts:")

for prompt in test_prompts:

generated = generate_text(model, tokenizer, prompt)

print(f"\nPrompt: {prompt}")

print(f"Generated: {generated}")

print("-" * 80)

The model was used to generate text in the generate_text() function. It expects a partial sentence as input prompt, and the model will be used to generate the next token for each step inside the for-loop. The generation algorithm uses a probability sampling rather than always choosing the most likely token. This allows the model to generate more creative text. The temperature parameter controls the level of creativity of the generated text.

The output from the model is a vector of logits, and the sampling process generates a vector of token IDs. This vector will be converted back to a string by the tokenizer.

If you run this code, you may see the following output:

Generating sample texts:

Prompt: Once upon a time,

Generated: Once upon a time, and

Tom rested with her, and they talked of home, and the friends there,

and the comfortable beds and, above all, the light! Becky cried, and Tom

tried to think of some way of comforting her, but all his encouragements

were grown thread-bare with use, and sounded like sarcasms. Fatigue bore

so heavily upon Becky that she drowsed off to sleep. Tom was grateful.

--------------------------------------------------------------------------------

Prompt: We the people of the

Generated: We the people of the French

near: going a terrible danger there is written

in the case of the perilous event that we sprang

from the apes. The soldiers who, afraid of being

able to attack the French themselves.

It was plain that the Russian nest was ruined and destroyed, but in

place of the Russian order of life that had been destroyed, Pierre

unconsciously felt that a quite different, firm, French order had been

established over this ruined nest.

--------------------------------------------------------------------------------

Prompt: In the beginning was the

Generated: In the beginning was the first

rummation of my beloved sister.”

“One moment,” said Holmes, “are you sure about the

Emperorer clean, and not you, and I need not trouble so far

as I had a machine than Jane well enough, and of being

with the truth of a woman asylum a man of not oftenionable,

with any other, and the consequences of such a step must

be waived for the fête as

--------------------------------------------------------------------------------

While the generated text shows some coherence and understanding of language patterns, it’s not perfect. However, considering the relatively small size of our model and limited training data, these results are encouraging.

For completeness, below is the full code of the model and the training:

import os

import requests

import torch

import torch.nn as nn

import torch.nn.functional as F

import torch.optim as optim

import tokenizers

import tqdm