Artificial Intelligence , Machine Learning and Data Science Hubspot

Unlock the Power of Artificial Intelligence, Machine Learning, and Data Science with our Blog Discover the latest insights, trends, and innovations in Artificial Intelligence (AI), Machine Learning (ML), and Data Science through our informative and engaging Hubspot blog. Gain a deep understanding of how these transformative technologies are shaping industries and revolutionizing the way we work. Stay updated with cutting-edge advancements, practical applications, and real-world use.

Friday, 20 March 2026

A Gentle Introduction to Language Model Fine-tuning

After pretraining, a language model learns about human languages. You can enhance the model’s domain-specific understanding by training it on additional data. You can also train the model to perform specific tasks when you provide a specific instruction. These additional training after pretraining is called fine-tuning. In this article, you will learn how to fine-tune a language model. Specifically, you will learn:

Different examples of fine-tuning and what their goals are
How to convert a pretraining script to perform fine-tuning

Let’s get started!

A Gentle Introduction to Language Model Fine-tuning
Photo by Nick Night. Some rights reserved.

Overview

This article is divided into four parts; they are:

The Reason for Fine-tuning a Model
Dataset for Fine-tuning
Fine-tuning Procedure
Other Fine-Tuning Techniques

The Reason for Fine-tuning a Model

Once you train your decoder-only transformer model, you have a text generator. You can provide any prompt, and the model will generate some text. What it generates depends on the model you have.

Let’s consider a very simple generation algorithm:

...

def apply_repetition_penalty(logits: Tensor, tokens: list[int], penalty: float) -> Tensor:

"""Apply repetition penalty to the logits."""

for tok in tokens:

if logits[tok] > 0:

logits[tok] /= penalty

else:

logits[tok] *= penalty

return logits

@torch.no_grad()

def generate(model, tokenizer, prompt, max_tokens=100, temperature=1.0, repetition_penalty=1.0,

repetition_penalty_range=10, top_k=50, device=None) -> str:

"""Generate text autoregressively from a prompt.

Args:

model: The trained LlamaForPretraining model

tokenizer: The tokenizer

prompt: Input text prompt

max_tokens: Maximum number of tokens to generate

temperature: Sampling temperature (higher = more random)

repetition_penalty: Penalty for repeating tokens

repetition_penalty_range: Number of previous tokens to consider for repetition penalty

top_k: Only sample from top k most likely tokens

device: Device the model is loaded on

Returns:

Generated text

"""

# Turn model to evaluation mode: Norm layer will work differently

model.eval()

# Get special token IDs

bot_id = tokenizer.token_to_id("[BOT]")

eot_id = tokenizer.token_to_id("[EOT]")

# Tokenize the prompt into integer tensor

prompt_tokens = [bot_id] + tokenizer.encode(" " + prompt).ids

input_ids = torch.tensor(prompt_tokens, dtype=torch.int64, device=device).unsqueeze(0)

# Recursively generate tokens

generated_tokens = []

for _step in range(max_tokens):

# Forward pass through model

logits = model(input_ids)

# Get logits for the last token

next_token_logits = logits[0, -1, :] / temperature

# Apply repetition penalty

if repetition_penalty != 1.0 and len(generated_tokens) > 0:

next_token_logits = apply_repetition_penalty(

next_token_logits,

generated_tokens[-repetition_penalty_range:],

repetition_penalty,

)

# Apply top-k filtering

if top_k > 0:

top_k_logits = torch.topk(next_token_logits, top_k)[0]

indices_to_remove = next_token_logits < top_k_logits[-1]

next_token_logits[indices_to_remove] = float("-inf")

# Sample from the filtered distribution

probs = F.softmax(next_token_logits, dim=-1)

next_token = torch.multinomial(probs, num_samples=1)

# Early stop if EOT token is generated

if next_token.item() == eot_id:

break

# Append the new token to input_ids for next iteration

input_ids = torch.cat([input_ids, next_token.unsqueeze(0)], dim=1)

generated_tokens.append(next_token.item())

# Decode all generated tokens

return tokenizer.decode(generated_tokens)

The function generate() above is an inefficient yet simple sampling-based text-generation method. Your model takes a prompt and generates a tensor of logits for the next token. They are called logits because they are proportional to the log of the probabilities of the next token. The model works on tokens. To generate a token, there are a few steps to work on the logits:

Scale the logits by a temperature parameter. This skews the probabilities of the next token to pick.
Manipulate the logits. In the above, you applied a repetition penalty to give a penalty to the tokens that already exist in the generated sequence of tokens. You also applied top-𝑘 filtering to limit the choice to among the top 𝑘 most likely tokens.
Convert the logits to probabilities, then use a multinomial sampling algorithm to select the next token.

You can make this simpler by always using torch.argmax() to pick the next token. This is called greedy decoding. It is generally not preferred because the output does not appear natural, and no variation is permitted.

You can try to use it with your own model trained in the previous article. Below is a complete code to generate text with a simple prompt:

import dataclasses

import tokenizers

import torch

import torch.nn as nn

import torch.nn.functional as F

from torch import Tensor

# Model architecture same as training script

@dataclasses.dataclass

class LlamaConfig:

"""Define Llama model hyperparameters."""

vocab_size: int = 50000

max_position_embeddings: int = 2048

hidden_size: int = 768

intermediate_size: int = 4*768

num_hidden_layers: int = 12

num_attention_heads: int = 12

num_key_value_heads: int = 3

class RotaryPositionEncoding(nn.Module):

"""Rotary position encoding."""

def __init__(self, dim: int, max_position_embeddings: int) -> None:

super().__init__()

self.dim = dim

self.max_position_embeddings = max_position_embeddings

N = 10_000.0

inv_freq = 1.0 / (N ** (torch.arange(0, dim, 2) / dim))

inv_freq = torch.cat((inv_freq, inv_freq), dim=-1)

position = torch.arange(max_position_embeddings)

sinusoid_inp = torch.outer(position, inv_freq)

self.register_buffer("cos", sinusoid_inp.cos())

self.register_buffer("sin", sinusoid_inp.sin())

def forward(self, x: Tensor) -> Tensor:

batch_size, seq_len, num_heads, head_dim = x.shape

device = x.device

dtype = x.dtype

cos = self.cos.to(device, dtype)[:seq_len].view(1, seq_len, 1, -1)

sin = self.sin.to(device, dtype)[:seq_len].view(1, seq_len, 1, -1)

x1, x2 = x.chunk(2, dim=-1)

rotated = torch.cat((-x2, x1), dim=-1)

return (x * cos) + (rotated * sin)

class LlamaAttention(nn.Module):

"""Grouped-query attention with rotary embeddings."""

def __init__(self, config: LlamaConfig) -> None:

super().__init__()

self.hidden_size = config.hidden_size

self.num_heads = config.num_attention_heads

self.head_dim = self.hidden_size // self.num_heads

self.num_kv_heads = config.num_key_value_heads

assert (self.head_dim * self.num_heads) == self.hidden_size

self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=False)

self.k_proj = nn.Linear(self.hidden_size, self.num_kv_heads * self.head_dim, bias=False)

self.v_proj = nn.Linear(self.hidden_size, self.num_kv_heads * self.head_dim, bias=False)

self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=False)

def forward(self, hidden_states: Tensor, rope: RotaryPositionEncoding) -> Tensor:

bs, seq_len, dim = hidden_states.size()

query_states = self.q_proj(hidden_states).view(bs, seq_len, self.num_heads, self.head_dim)

key_states = self.k_proj(hidden_states).view(bs, seq_len, self.num_kv_heads, self.head_dim)

value_states = self.v_proj(hidden_states).view(bs, seq_len, self.num_kv_heads, self.head_dim)

attn_output = F.scaled_dot_product_attention(

rope(query_states).transpose(1, 2),

rope(key_states).transpose(1, 2),

value_states.transpose(1, 2),

is_causal=True,

dropout_p=0.0,

enable_gqa=True,

)

attn_output = attn_output.transpose(1, 2).reshape(bs, seq_len, self.hidden_size)

return self.o_proj(attn_output)

class LlamaMLP(nn.Module):

"""Feed-forward network with SwiGLU activation."""

def __init__(self, config: LlamaConfig) -> None:

super().__init__()

self.gate_proj = nn.Linear(config.hidden_size, config.intermediate_size, bias=False)

self.up_proj = nn.Linear(config.hidden_size, config.intermediate_size, bias=False)

self.act_fn = F.silu

self.down_proj = nn.Linear(config.intermediate_size, config.hidden_size, bias=False)

def forward(self, x: Tensor) -> Tensor:

gate = self.act_fn(self.gate_proj(x))

up = self.up_proj(x)

return self.down_proj(gate * up)

class LlamaDecoderLayer(nn.Module):

"""Single transformer layer for a Llama model."""

def __init__(self, config: LlamaConfig) -> None:

super().__init__()

self.input_layernorm = nn.RMSNorm(config.hidden_size, eps=1e-5)

self.self_attn = LlamaAttention(config)

self.post_attention_layernorm = nn.RMSNorm(config.hidden_size, eps=1e-5)

self.mlp = LlamaMLP(config)

def forward(self, hidden_states: Tensor, rope: RotaryPositionEncoding) -> Tensor:

residual = hidden_states

hidden_states = self.input_layernorm(hidden_states)

attn_outputs = self.self_attn(hidden_states, rope=rope)

hidden_states = attn_outputs + residual

residual = hidden_states

hidden_states = self.post_attention_layernorm(hidden_states)

return self.mlp(hidden_states) + residual

class LlamaModel(nn.Module):

"""The full Llama model without any pretraining heads."""

def __init__(self, config: LlamaConfig) -> None:

super().__init__()

self.rotary_emb = RotaryPositionEncoding(

config.hidden_size // config.num_attention_heads,

config.max_position_embeddings,

)

self.embed_tokens = nn.Embedding(config.vocab_size, config.hidden_size)

self.layers = nn.ModuleList([

LlamaDecoderLayer(config) for _ in range(config.num_hidden_layers)

])

self.norm = nn.RMSNorm(config.hidden_size, eps=1e-5)

def forward(self, input_ids: Tensor) -> Tensor:

hidden_states = self.embed_tokens(input_ids)

for layer in self.layers:

hidden_states = layer(hidden_states, rope=self.rotary_emb)

return self.norm(hidden_states)

class LlamaForPretraining(nn.Module):

def __init__(self, config: LlamaConfig) -> None:

super().__init__()

self.base_model = LlamaModel(config)

self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False)

def forward(self, input_ids: Tensor) -> Tensor:

hidden_states = self.base_model(input_ids)

return self.lm_head(hidden_states)

def apply_repetition_penalty(logits: Tensor, tokens: list[int], penalty: float) -> Tensor:

"""Apply repetition penalty to the logits."""

for tok in tokens:

if logits[tok] > 0:

logits[tok] /= penalty

else:

logits[tok] *= penalty

return logits

@torch.no_grad()

def generate(model, tokenizer, prompt, max_tokens=100, temperature=1.0, repetition_penalty=1.0,

repetition_penalty_range=10, top_k=50, device=None) -> str:

"""Generate text autoregressively from a prompt.

Args:

model: The trained LlamaForPretraining model

tokenizer: The tokenizer

prompt: Input text prompt

max_tokens: Maximum number of tokens to generate

temperature: Sampling temperature (higher = more random)

repetition_penalty: Penalty for repeating tokens

repetition_penalty_range: Number of previous tokens to consider for repetition penalty

top_k: Only sample from top k most likely tokens

device: Device the model is loaded on

Returns:

Generated text

"""

# Turn model to evaluation mode: Norm layer will work differently

model.eval()

# Get special token IDs

bot_id = tokenizer.token_to_id("[BOT]")

eot_id = tokenizer.token_to_id("[EOT]")

# Tokenize the prompt into integer tensor

prompt_tokens = [bot_id] + tokenizer.encode(" " + prompt).ids

input_ids = torch.tensor([prompt_tokens], dtype=torch.int64, device=device)

# Recursively generate tokens

generated_tokens = []

for _step in range(max_tokens):

# Forward pass through model

logits = model(input_ids)

# Get logits for the last token

next_token_logits = logits[0, -1, :] / temperature

# Apply repetition penalty

if repetition_penalty != 1.0 and len(generated_tokens) > 0:

next_token_logits = apply_repetition_penalty(

next_token_logits,

generated_tokens[-repetition_penalty_range:],

repetition_penalty,

)

# Apply top-k filtering

if top_k > 0:

top_k_logits = torch.topk(next_token_logits, top_k)[0]

indices_to_remove = next_token_logits < top_k_logits[-1]

next_token_logits[indices_to_remove] = float("-inf")

# Sample from the filtered distribution

probs = F.softmax(next_token_logits, dim=-1)

next_token = torch.multinomial(probs, num_samples=1)

# Early stop if EOT token is generated

if next_token.item() == eot_id:

break

# Append the new token to input_ids for next iteration

input_ids = torch.cat([input_ids, next_token.unsqueeze(0)], dim=1)

generated_tokens.append(next_token.item())

# Decode all generated tokens

return tokenizer.decode(generated_tokens)

checkpoint = "llama_model_final.pth" # saved model checkpoint

tokenizer = "bpe_50K.json" # saved tokenizer

max_tokens = 100

temperature = 0.9

top_k = 50

penalty = 1.1

penalty_range = 10

# Load tokenizer and model

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

tokenizer = tokenizers.Tokenizer.from_file(tokenizer)

config = LlamaConfig()

model = LlamaForPretraining(config).to(device)

model.load_state_dict(torch.load(checkpoint, map_location=device))

prompt = "Once upon a time, there was"

response = generate(

model=model,

tokenizer=tokenizer,

prompt=prompt,

max_tokens=max_tokens,

temperature=temperature,

top_k=top_k,

repetition_penalty=penalty,

repetition_penalty_range=penalty_range,

device=device,

)

print(prompt)

print("-" * 20)

print(response)

If you use the model you trained exactly as described in the previous article, the results will likely be poor. As a model to complete a sentence, it may even produce gibberish. This is because you trained a very small model on a very small dataset (merely 10 billion tokens). Any model of reasonable performance is trained on a trillion or more tokens.

You can modify the code above to use a model from Hugging Face Hub. For example, you can use TinyLlama v1.1 or the Llama 3.2 1B model, both of which are compact with 1 billion parameters. The modified script is as follows:

import torch

import torch.nn.functional as F

from torch import Tensor

from transformers import AutoModelForCausalLM, AutoTokenizer

def apply_repetition_penalty(logits: Tensor, tokens: list[int], penalty: float) -> Tensor:

"""Apply repetition penalty to the logits."""

for tok in tokens:

if logits[tok] > 0:

logits[tok] /= penalty

else:

logits[tok] *= penalty

return logits

@torch.no_grad()

def generate(model, tokenizer, prompt, max_tokens=100, temperature=1.0, repetition_penalty=1.0,

repetition_penalty_range=10, top_k=50, device=None) -> str:

"""Generate text autoregressively from a prompt.

Args:

model: The Hugging Face model for causal language modeling

tokenizer: The Hugging Face tokenizer

prompt: Input text prompt

max_tokens: Maximum number of tokens to generate

temperature: Sampling temperature (higher = more random)

repetition_penalty: Penalty for repeating tokens

repetition_penalty_range: Number of previous tokens to consider for repetition penalty

top_k: Only sample from top k most likely tokens

device: Device the model is loaded on

Returns:

Generated text

"""

# Turn model to evaluation mode

model.eval()

# Get EOT token ID

eot_id = tokenizer.eos_token_id

# Tokenize the prompt into integer tensor

input_ids = tokenizer.encode(prompt, return_tensors="pt").to(device)

# Recursively generate tokens

generated_tokens = []

for _step in range(max_tokens):

# Forward pass through model from `transformers` library

logits = model(input_ids).logits

# Get logits for the last token

next_token_logits = logits[0, -1, :] / temperature

# Apply repetition penalty

if repetition_penalty != 1.0 and len(generated_tokens) > 0:

next_token_logits = apply_repetition_penalty(

next_token_logits,

generated_tokens[-repetition_penalty_range:],

repetition_penalty,

)

# Apply top-k filtering

if top_k > 0:

top_k_logits = torch.topk(next_token_logits, top_k)[0]

indices_to_remove = next_token_logits < top_k_logits[-1]

next_token_logits[indices_to_remove] = float("-inf")

# Sample from the filtered distribution

probs = F.softmax(next_token_logits, dim=-1)

next_token = torch.multinomial(probs, num_samples=1)

# Early stop if EOT token is generated

if next_token.item() == eot_id:

break

# Append the new token to input_ids for next iteration

input_ids = torch.cat([input_ids, next_token.unsqueeze(0)], dim=1)

generated_tokens.append(next_token.item())

# Decode all generated tokens

return tokenizer.decode(generated_tokens, skip_special_tokens=True)

model_id = "meta-llama/Llama-3.2-1B" # or "TinyLlama/TinyLlama_v1.1"

max_tokens = 100

temperature = 0.9

top_k = 50

penalty = 1.1

penalty_range = 10

# Load tokenizer and model

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

tokenizer = AutoTokenizer.from_pretrained(model_id)

if tokenizer.pad_token is None:

tokenizer.pad_token = tokenizer.eos_token

model = AutoModelForCausalLM.from_pretrained(

model_id,

dtype=torch.bfloat16 if torch.cuda.is_available() else torch.float32,

device_map="auto" if torch.cuda.is_available() else None,

)

prompt = "Once upon a time, there was"

response = generate(

model=model,

tokenizer=tokenizer,

prompt=prompt,

max_tokens=max_tokens,

temperature=temperature,

top_k=top_k,

repetition_penalty=penalty,

repetition_penalty_range=penalty_range,

device=device,

)

print(prompt)

print("-" * 20)

print(response)

When you run this script, you may get a result like this:

Once upon a time, there was

--------------------

a little boy whose name was Thomas.

Thomas's mother had died when he was a little boy, and his father had died even sooner.

Thomas found this out when his father was driving a truck.

Thomas's grandfather had taught him how to drive by watching TV.

His grandfather took the truck out to a remote part of the forest.

And then he started to tell Thomas, "Now sit here and watch. Everything about this."

And then he stopped talking, put on the radio and said,"Now

The result is not bad. Let’s try a more sophisticated prompt:

...

# prompt = "Once upon a time, there was a" # replaced with the following

prompt = """

Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:

Find the verb in the following sentence.

### Input:

John and Mary walk across the street to the bookstore.

### Response:

"""

You will find that the same model now gives you a result that is not quite right:

Below is an instruction that describes a task. Write a response that appropriately

completes the request.

### Instruction:

Find the verb in the following sentence.

### Input:

John and Mary walk across the street to the bookstore.

### Response:

--------------------

Answer the following questions about 12th grade.

1. When did you start working at 12th grade?

2. On which day did you retire from 12th grade?

3. What is the most exciting thing you've seen recently?

4. On which day did you go fishing with your family?

5. What can you get from 12th grade?

6. On what day are you the most excited right now?

7. Was that time when you were 12th

But if you replaced the model with an “instruct” model, such as below:

...

# model_name = "meta-llama/Llama-3.2-1B" # replaced with the following

model_name = "meta-llama/Llama-3.2-1B-Instruct"

You will get:

Below is an instruction that describes a task. Write a response that appropriately

completes the request.

### Instruction:

Find the verb in the following sentence.

### Input:

John and Mary walk across the street to the bookstore.

### Response:

--------------------

The verb in the sentence is "walk."

The answer is correct: It sounds like the model understands what you want and gives you a sensible answer.

Why is there a difference? A pretrained model is a language model that learns about the structure of a language. However, a model trained on a large corpus of random text is unlikely to know that any instruction you provide means it has to act on it. Instruction fine-tuning provides many examples of such instructions and trains the model again, so that, when you give it an instruction, the model should act on it. Similarly, there is conversation fine-tuning to create a chat model. When you provide a chat history to train the model, it learns how to respond to the user’s message.

Dataset for Fine-tuning

The success of fine-tuning a model largely depends on the dataset you use. Depending on the fine-tuning goal, you need to select and prepare the dataset accordingly. Some common examples are:

Fine-tuning on text completion and reasoning: This is the simplest use case. It simply means using a different corpus to retrain the model. For example, further training Llama 2 on a code-specific dataset yields the Code Llama model, which can generate code and natural-language descriptions of code.
Fine-tuning on instruction following: This is the most common use case. Use an instruction-following dataset to fine-tune a pretrained model. This enables the model to handle single-turn directives such as “write”, “list”, and “explain”. The dataset contains samples of instruction-desired output pairs. Models fine-tuned on instruction-following datasets are well-suited to serve as task agents.
Fine-tuning on chat: This creates a chat model that can generate natural, engaging multi-turn dialogue. The dataset contains samples of conversation history. The model will learn the back-and-forth exchanges while maintaining a persona. The fine-tuned models are chatbots and virtual assistants, such as the one behind ChatGPT.

Numerous datasets are available for fine-tuning. You can search for them on Hugging Face Hub. One example for instruction-following is tatsu-lab/alpaca, which contains 52K samples of instruction-desired output pairs. One example of a chat dataset is the HuggingFaceH4/ultrachat_200k dataset, which contains 200K samples of chat history.

Below, you will see how to run fine-tuning on these datasets.

Fine-tuning Procedure

From a training-script perspective, fine-tuning a model is no different from pretraining it. The same auto-regressive model architecture is used, and the loss function remains cross-entropy between the output logits and the target tokens. The only difference lies in the dataset used to train the model. It usually fits a specific template, and there are far fewer samples than in the pretraining dataset. Because of the smaller dataset, the fine-tuning process is much shorter, too.

Let’s fine-tune an instruction-following model. The template to use is as follows:

Below is an instruction that describes a task. Write a response that appropriately

completes the request.

### Instruction:

{instruction}

### Input:

{input}

### Response:

{output}

This is the one used in the example above. You are expected to provide a dataset with instructions, input, and output. Then, you substitute them into the template above to create a prompt for the model. The model will learn not only to understand your instruction, but also that this template indicates it should act on the instruction and produce the response.

Take the dataset tatsu-lab/alpaca as an example. You can create samples as follows:

from datasets import load_dataset

ds = load_dataset("tatsu-lab/alpaca", split="train")

print(ds)

print("---")

print(ds[0])

print("---")

template = """

Below is an instruction that describes a task. Write a response that appropriately

completes the request.

### Instruction:

{instruction}

### Input:

{input}

### Response:

{output}

"""

samples = []

for sample in ds:

samples.append(template.format(

instruction=sample["instruction"],

input=sample["input"],

output=sample["output"])

)

break

print(samples[0])

When you run this code, you will get:

Dataset({

features: ['instruction', 'input', 'output', 'text'],

num_rows: 52002

})

---

{'instruction': 'Give three tips for staying healthy.', 'input': '', 'output': '1.Eat a ba

lanced diet and make sure to include plenty of fruits and vegetables. \n2. Exercise regula

rly to keep your body active and strong. \n3. Get enough sleep and maintain a consistent s

leep schedule.', 'text': 'Below is an instruction that describes a task. Write a response

that appropriately completes the request.\n\n### Instruction:\nGive three tips for staying

healthy.\n\n### Response:\n1.Eat a balanced diet and make sure to include plenty of fruit

s and vegetables. \n2. Exercise regularly to keep your body active and strong. \n3. Get e

nough sleep and maintain a consistent sleep schedule.'}

---

Below is an instruction that describes a task. Write a response that appropriately

completes the request.

### Instruction:

Give three tips for staying healthy.

### Input:

### Response:

1.Eat a balanced diet and make sure to include plenty of fruits and vegetables.

2. Exercise regularly to keep your body active and strong.

3. Get enough sleep and maintain a consistent sleep schedule.

As shown, the tatsu-lab/alpaca dataset contains 52,002 samples in the train split (this dataset has only one split). Each sample has four columns: instruction, input, output, and text. You used instruction, input, and output from each sample to fill in the template and generate a complete prompt-response string. This is what you will use to fine-tune your model.

Knowing how to create samples from a dataset, you can fine-tune your model with the following code:

import datasets

import torch

import torch.nn as nn

import torch.optim.lr_scheduler as lr_scheduler

import tqdm

from datasets import load_dataset

from torch import Tensor

from transformers import AutoModelForCausalLM, AutoTokenizer

# Load training dataset

ds = load_dataset("tatsu-lab/alpaca", split="train")

template = """

Below is an instruction that describes a task, paired with an input that provides further

context. Write a response that appropriately completes the request.

### Instruction:

{instruction}

### Input:

{input}

### Response:

{output}

""".strip()

class FineTuningDataset(torch.utils.data.Dataset):

def __init__(self, dataset: datasets.Dataset, tokenizer: AutoTokenizer,

seq_length: int, device: torch.device = None):

self.dataset = dataset

self.tokenizer = tokenizer

self.device = device

self.seq_length = seq_length

self.bot = torch.tensor([tokenizer.bos_token_id], dtype=torch.int64, device=device)

self.eot = torch.tensor([tokenizer.eos_token_id], dtype=torch.int64, device=device)

self.pad = torch.tensor([tokenizer.pad_token_id], dtype=torch.int64, device=device)

def __len__(self):

return len(self.dataset)

def __getitem__(self, index: int):

"""Get a sequence of token ids from the dataset. [BOT] and [EOT] tokens

are added. Clipped and padded to the sequence length.

"""

seq = template.format(

instruction=self.dataset[index]["instruction"],

input=self.dataset[index]["input"],

output=self.dataset[index]["output"],

)

tokens: Tensor = self.tokenizer(seq, return_tensors="pt", add_special_tokens=False).input_ids[0]

tokens = torch.cat([self.bot, tokens.to(self.device), self.eot], dim=0)

# pad to target sequence length

toklen = len(tokens)

if toklen < self.seq_length+1:

pad_length = self.seq_length+1 - toklen

tokens = torch.cat([tokens, self.pad.repeat(pad_length)], dim=0)

# return the sequence

x = tokens[:self.seq_length]

y = tokens[1:self.seq_length+1]

return x, y

# Load model and tokenizer

model_id = "meta-llama/Llama-3.2-1B"

tokenizer = AutoTokenizer.from_pretrained(model_id, use_fast=False)

tokenizer.pad_token = tokenizer.eos_token

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

model = AutoModelForCausalLM.from_pretrained(

model_id,

dtype=torch.bfloat16 if torch.cuda.is_available() else torch.float32,

device_map=device,

)

model.train()

# Training parameters

epochs = 2

learning_rate = 1e-4

batch_size = 4

seq_length = 1024

num_warmup_steps = 1000

PAD_TOKEN_ID = tokenizer.pad_token_id

# DataLoader, optimizer, scheduler, and loss function

dataset = FineTuningDataset(ds, tokenizer, seq_length=seq_length, device=device)

dataloader = torch.utils.data.DataLoader(dataset, batch_size=batch_size)

optimizer = torch.optim.AdamW(

model.parameters(), lr=learning_rate, betas=(0.9, 0.99), eps=1e-8, weight_decay=0.1,

)

num_training_steps = len(dataloader) * epochs

scheduler = lr_scheduler.CosineAnnealingLR(

optimizer,

T_max=num_training_steps,

eta_min=0,

)

loss_fn = nn.CrossEntropyLoss(ignore_index=PAD_TOKEN_ID)

# start training

for epoch in range(epochs):

pbar = tqdm.tqdm(dataloader, desc=f"Epoch {epoch+1}/{epochs}")

for batch in pbar:

# get batched data

input_ids, target_ids = batch

# create attention mask for Llama model from transformers

attn_mask = input_ids.ne(PAD_TOKEN_ID)

# extract output from model

logits = model(input_ids, attention_mask=attn_mask).logits

# compute loss: cross-entropy between logits and target, ignoring padding tokens

loss = loss_fn(logits.view(-1, logits.size(-1)), target_ids.view(-1))

# backward with loss and gradient clipping by L2 norm to 1.0

optimizer.zero_grad()

loss.backward()

torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)

optimizer.step()

scheduler.step()

pbar.set_postfix(loss=loss.item())

pbar.update(1)

pbar.close()

# Save the model

torch.save(model.state_dict(), "model-instruct.pth")

This code fine-tunes the Llama 3.2 1B model from Hugging Face Hub (model name: meta-llama/Llama-3.2-1B) with the tatsu-lab/alpaca dataset. Because the dataset is smaller, the training process is much shorter than pretraining. You can revise the code to fine-tune with your own model, such as the one you trained in the previous article. In particular, you need to change how the model and tokenizer are initialized and how the model processes the BOT and EOT tokens. The Llama 3.2 1B model is used because it is larger and more capable than the one you created in the previous article. It should be easier to see the effect of fine-tuning.

Compared with pretraining, the dataset is created using a template rather than the text itself. The loss is still cross-entropy. However, you use a lower learning rate in fine-tuning because the model is already trained to understand the language, and you do not want to destroy this capability. For a similar reason, you usually do not need warmup steps in the learning rate scheduler. Simply decrease the learning rate steadily from the start to the end of training.

Since it is very similar to the self-supervised pretraining, this process is called supervised fine-tuning (SFT).

Let’s see another example: This time, you will fine-tune a chat model. The dataset to use is HuggingFaceH4/ultrachat_200k. Below shows what the dataset looks like:

from datasets import load_dataset

ds = load_dataset("HuggingFaceH4/ultrachat_200k")

print(ds)

print("---")

print(ds["train_sft"][0])

print("---")

template = "<|{role}|>\n{content}\n"

samples = []

for sample in ds["train_sft"]:

prompt = ""

for message in sample["messages"]:

prompt += template.format(

role=message["role"],

content=message["content"],

)

samples.append(prompt)

print(samples[0])

When you run this code, you will get:

DatasetDict({

train_sft: Dataset({

features: ['prompt', 'prompt_id', 'messages'],

num_rows: 207865

})

test_sft: Dataset({

features: ['prompt', 'prompt_id', 'messages'],

num_rows: 23110

})

train_gen: Dataset({

features: ['prompt', 'prompt_id', 'messages'],

num_rows: 256032

})

test_gen: Dataset({

features: ['prompt', 'prompt_id', 'messages'],

num_rows: 28304

})

---

{'prompt': "These instructions apply to section-based themes (Responsive 6.0+, Retina 4.0+

...

me developer for assistance with customizing your Quick Shop section to include this featu

re.", 'role': 'assistant'}]}

---

<|user|>

These instructions apply to section-based themes (Responsive 6.0+, Retina 4.0+, Parallax

3.0+ Turbo 2.0+, Mobilia 5.0+). What theme version am I using?

On your Collections pages & Featured Collections sections, you can easily show the

secondary image of a product on hover by enabling one of the theme's built-in settings!

Your Collection pages & Featured Collections sections will now display the secondary

product image just by hovering over that product image thumbnail.

Does this feature apply to all sections of the theme or just specific ones as listed in

the text material?

<|assistant|>

This feature only applies to Collection pages and Featured Collections sections of the

section-based themes listed in the text material.

<|user|>

Can you guide me through the process of enabling the secondary image hover feature on my

Collection pages and Featured Collections sections?

<|assistant|>

Sure, here are the steps to enable the secondary image hover feature on your Collection

pages and Featured Collections sections:

...

Unlike the instruction dataset, which has one instruction and one response per sample, this dataset contains multiple messages per sample, with the roles “user” and “assistant” alternating. The template you use should mark the roles with <|user|> and <|assistant|> respectively. The prompt you create may be lengthy, as each sample includes multiple back-and-forth exchanges.

To fine-tune the model with this dataset, you can modify the code above on how you create samples:

import datasets

import torch

import torch.nn as nn

import torch.optim.lr_scheduler as lr_scheduler

import tqdm

from datasets import load_dataset

from transformers import AutoModelForCausalLM, AutoTokenizer

# Load training dataset

ds = load_dataset("HuggingFaceH4/ultrachat_200k", split="train_sft")

template = "<|{role}|>\n{content}\n"

class FineTuningDataset(torch.utils.data.Dataset):

def __init__(self, dataset: datasets.Dataset, tokenizer: AutoTokenizer,

seq_length: int, device: torch.device = None):

self.dataset = dataset

self.tokenizer = tokenizer

self.device = device

self.seq_length = seq_length

self.bot = torch.tensor([tokenizer.bos_token_id], dtype=torch.int64, device=device)

self.eot = torch.tensor([tokenizer.eos_token_id], dtype=torch.int64, device=device)

self.pad = torch.tensor([tokenizer.pad_token_id], dtype=torch.int64, device=device)

def __len__(self):

return len(self.dataset)

def __getitem__(self, index: int):

"""Get a sequence of token ids from the dataset. [BOT] and [EOT] tokens

are added. Clipped and padded to the sequence length.

"""

tokens = self.bot

for message in self.dataset[index]["messages"]:

seq = template.format(

role=message["role"],

content=message["content"],

)

seq = self.tokenizer(seq, return_tensors="pt", add_special_tokens=False).input_ids[0]

tokens = torch.cat([tokens, seq.to(self.device), self.eot], dim=0)

mask = torch.ones_like(tokens)

# pad to target sequence length

toklen = len(tokens)

if toklen < self.seq_length+1:

pad_length = self.seq_length+1 - toklen

tokens = torch.cat([tokens, self.pad.repeat(pad_length)], dim=0)

mask = torch.cat([mask, torch.zeros(pad_length).to(self.device)], dim=0)

# return the sequence

x = tokens[:self.seq_length]

y = tokens[1:self.seq_length+1]

mask = mask[:self.seq_length]

return x, y, mask

# Load model and tokenizer

model_id = "meta-llama/Llama-3.2-1B"

tokenizer = AutoTokenizer.from_pretrained(model_id, use_fast=False)

tokenizer.pad_token = tokenizer.eos_token

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

model = AutoModelForCausalLM.from_pretrained(

model_id,

dtype=torch.bfloat16 if torch.cuda.is_available() else torch.float32,

device_map=device,

)

model.train()

# Training parameters

epochs = 2

learning_rate = 1e-4

batch_size = 4

seq_length = 1024

num_warmup_steps = 1000

PAD_TOKEN_ID = tokenizer.pad_token_id

# DataLoader, optimizer, scheduler, and loss function

dataset = FineTuningDataset(ds, tokenizer, seq_length=seq_length, device=device)

dataloader = torch.utils.data.DataLoader(dataset, batch_size=batch_size)

optimizer = torch.optim.AdamW(

model.parameters(), lr=learning_rate, betas=(0.9, 0.99), eps=1e-8, weight_decay=0.1,

)

num_training_steps = len(dataloader) * epochs

scheduler = lr_scheduler.CosineAnnealingLR(

optimizer,

T_max=num_training_steps,

eta_min=0,

)

loss_fn = nn.CrossEntropyLoss(ignore_index=PAD_TOKEN_ID)

# start training

for epoch in range(epochs):

pbar = tqdm.tqdm(dataloader, desc=f"Epoch {epoch+1}/{epochs}")

for batch in pbar:

# get batched data with attention mask

input_ids, target_ids, attn_mask = batch

# extract output from model

logits = model(input_ids, attention_mask=attn_mask).logits

# compute loss: cross-entropy between logits and target, ignoring padding tokens

loss = loss_fn(logits.view(-1, logits.size(-1)), target_ids.view(-1))

# backward with loss and gradient clipping by L2 norm to 1.0

optimizer.zero_grad()

loss.backward()

torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)

optimizer.step()

scheduler.step()

pbar.set_postfix(loss=loss.item())

pbar.update(1)

pbar.close()

# Save the model

torch.save(model.state_dict(), "model-chat.pth")

There are a few differences from the instruction fine-tuning code above. First, you need a for-loop in the __getitem__() method of the FineTuningDataset class, since there are multiple exchanges of dialog in each sample. Second, the BOT token marks the beginning of the prompt, and EOT tokens are inserted between messages to mark their end. This is important because, when you use the fine-tuned model for chat, you expect it to generate an EOT token to mark the end of the response. Since the EOT token is also used as the padding token here, you need to create an attention mask to indicate where padding occurs explicitly.

That’s all you need to fine-tune a chat model. When you use the fine-tuned model for chat, you can reuse the same generate() function as above to generate the response. However, you need to construct the prompt with the correct pattern:

...

# Create a chat loop

chat_history = ""

template = "<|{role}|>\n{content}\n"

while True:

# Read user input

prompt = input("User: ").strip()

if prompt.lower() in ["quit", "exit", "q"]:

print("Goodbye!")

break

chat_history += template.format(role="user", content=prompt)

# Produce chatbot response

chat_history += "<|assistant|>\n"

response = generate(

model=model,

tokenizer=tokenizer,

prompt=chat_history,

max_tokens=max_tokens,

temperature=temperature,

top_k=top_k,

repetition_penalty=penalty,

repetition_penalty_range=penalty_range,

device=device,

)

# Print the response and add to the chat history

print("-" * 20)

print("Assistant:")

print(response)

print("-" * 20)

chat_history += response.strip() + "\n"

Other Fine-Tuning Techniques

This is the simplest way to fine-tune a model. If the fine-tuned model is not performing as expected, you may consider employing advanced techniques to improve the results. This is not about the speed at which the model processes the prompt and produces the response, as engineering techniques can boost inference throughput. Rather, this is about aligning the model’s response with our expectations.

RLHF is a technique for fine-tuning a model using reinforcement learning from human feedback. It is a method for the model to learn from human feedback on its responses. Human feedback is used to guide the model’s responses and align them more closely with human expectations. Fine-tuning a model with RLHF is a reinforcement learning process. Instead of a simple loss function, you need to create a reward function that measures how good the model’s response is. Then you can use the reward to guide the model’s response and align it more closely with human expectations. The common algorithm for RLHF is called Proximal Policy Optimization (PPO). Another algorithm that claims to be more efficient is Direct Preference Optimization (DPO).

In addition to the fine-tuning described above, you can choose not to update the entire model and instead use a smaller supplementary model to generate the response. This is called parameter-efficient fine-tuning (PEFT). PEFT is based on the observation that smaller models train faster. The most notable technique in PEFT is LoRA (Low-Rank Adaptation). It keeps the majority of the original model’s weights frozen. Then, a small number of trainable parameters, called adapters, are added to the output of selected layers of the model. During fine-tuning, the same training loop is used, but only the adapters are updated. It is called low-rank adaptation because, instead of a large weight matrix in the original model, it uses smaller matrices.

These are advanced fine-tuning techniques that typically require additional libraries to run the training. These will not be covered in this article.

Summary

In this article, you learned how to fine-tune a language model. In particular, you learned that:

You can fine-tune a model on a new corpus, as in self-supervised pretraining, to enhance its understanding of a specific domain.
You can fine-tune a model to follow instructions or to chat with a persona using an instruction-following dataset or a chat dataset, respectively, with an appropriate template.
Fine-tuning a model is usually faster because the pretrained model already understands the language, and you can use a much smaller dataset.
There are advanced fine-tuning techniques to improve model performance, such as RLHF and PEFT.

Artificial Intelligence , Machine Learning and Data Science Hubspot

Friday, 20 March 2026

A Gentle Introduction to Language Model Fine-tuning

Overview

The Reason for Fine-tuning a Model

Dataset for Fine-tuning

Fine-tuning Procedure

Other Fine-Tuning Techniques

Further Readings

Summary

No comments:

Post a Comment

Introduction to Small Language Models: The Complete Guide for 2026

Report Abuse

Labels

"Donate for a Noble Cause