After pretraining, a language model learns about human languages. You can enhance the model’s domain-specific understanding by training it on additional data. You can also train the model to perform specific tasks when you provide a specific instruction. These additional training after pretraining is called fine-tuning. In this article, you will learn how to fine-tune a language model. Specifically, you will learn:
- Different examples of fine-tuning and what their goals are
- How to convert a pretraining script to perform fine-tuning
Let’s get started!

A Gentle Introduction to Language Model Fine-tuning
Photo by Nick Night. Some rights reserved.
Overview
This article is divided into four parts; they are:
- The Reason for Fine-tuning a Model
- Dataset for Fine-tuning
- Fine-tuning Procedure
- Other Fine-Tuning Techniques
The Reason for Fine-tuning a Model
Once you train your decoder-only transformer model, you have a text generator. You can provide any prompt, and the model will generate some text. What it generates depends on the model you have.
Let’s consider a very simple generation algorithm:
... def apply_repetition_penalty(logits: Tensor, tokens: list[int], penalty: float) -> Tensor: """Apply repetition penalty to the logits.""" for tok in tokens: if logits[tok] > 0: logits[tok] /= penalty else: logits[tok] *= penalty return logits @torch.no_grad() def generate(model, tokenizer, prompt, max_tokens=100, temperature=1.0, repetition_penalty=1.0, repetition_penalty_range=10, top_k=50, device=None) -> str: """Generate text autoregressively from a prompt. Args: model: The trained LlamaForPretraining model tokenizer: The tokenizer prompt: Input text prompt max_tokens: Maximum number of tokens to generate temperature: Sampling temperature (higher = more random) repetition_penalty: Penalty for repeating tokens repetition_penalty_range: Number of previous tokens to consider for repetition penalty top_k: Only sample from top k most likely tokens device: Device the model is loaded on Returns: Generated text """ # Turn model to evaluation mode: Norm layer will work differently model.eval() # Get special token IDs bot_id = tokenizer.token_to_id("[BOT]") eot_id = tokenizer.token_to_id("[EOT]") # Tokenize the prompt into integer tensor prompt_tokens = [bot_id] + tokenizer.encode(" " + prompt).ids input_ids = torch.tensor(prompt_tokens, dtype=torch.int64, device=device).unsqueeze(0) # Recursively generate tokens generated_tokens = [] for _step in range(max_tokens): # Forward pass through model logits = model(input_ids) # Get logits for the last token next_token_logits = logits[0, -1, :] / temperature # Apply repetition penalty if repetition_penalty != 1.0 and len(generated_tokens) > 0: next_token_logits = apply_repetition_penalty( next_token_logits, generated_tokens[-repetition_penalty_range:], repetition_penalty, ) # Apply top-k filtering if top_k > 0: top_k_logits = torch.topk(next_token_logits, top_k)[0] indices_to_remove = next_token_logits < top_k_logits[-1] next_token_logits[indices_to_remove] = float("-inf") # Sample from the filtered distribution probs = F.softmax(next_token_logits, dim=-1) next_token = torch.multinomial(probs, num_samples=1) # Early stop if EOT token is generated if next_token.item() == eot_id: break # Append the new token to input_ids for next iteration input_ids = torch.cat([input_ids, next_token.unsqueeze(0)], dim=1) generated_tokens.append(next_token.item()) # Decode all generated tokens return tokenizer.decode(generated_tokens) |
The function generate() above is an inefficient
yet simple sampling-based text-generation method. Your model takes a
prompt and generates a tensor of logits for the next token. They are
called logits because they are proportional to the log of the
probabilities of the next token. The model works on tokens. To generate a
token, there are a few steps to work on the logits:
- Scale the logits by a temperature parameter. This skews the probabilities of the next token to pick.
- Manipulate the logits. In the above, you applied a repetition penalty to give a penalty to the tokens that already exist in the generated sequence of tokens. You also applied top-𝑘 filtering to limit the choice to among the top 𝑘 most likely tokens.
- Convert the logits to probabilities, then use a multinomial sampling algorithm to select the next token.
You can make this simpler by always using torch.argmax()
to pick the next token. This is called greedy decoding. It is generally
not preferred because the output does not appear natural, and no
variation is permitted.
You can try to use it with your own model trained in the previous article. Below is a complete code to generate text with a simple prompt:
import dataclasses import tokenizers import torch import torch.nn as nn import torch.nn.functional as F from torch import Tensor # Model architecture same as training script @dataclasses.dataclass class LlamaConfig: """Define Llama model hyperparameters.""" vocab_size: int = 50000 max_position_embeddings: int = 2048 hidden_size: int = 768 intermediate_size: int = 4*768 num_hidden_layers: int = 12 num_attention_heads: int = 12 num_key_value_heads: int = 3 class RotaryPositionEncoding(nn.Module): """Rotary position encoding.""" def __init__(self, dim: int, max_position_embeddings: int) -> None: super().__init__() self.dim = dim self.max_position_embeddings = max_position_embeddings N = 10_000.0 inv_freq = 1.0 / (N ** (torch.arange(0, dim, 2) / dim)) inv_freq = torch.cat((inv_freq, inv_freq), dim=-1) position = torch.arange(max_position_embeddings) sinusoid_inp = torch.outer(position, inv_freq) self.register_buffer("cos", sinusoid_inp.cos()) self.register_buffer("sin", sinusoid_inp.sin()) def forward(self, x: Tensor) -> Tensor: batch_size, seq_len, num_heads, head_dim = x.shape device = x.device dtype = x.dtype cos = self.cos.to(device, dtype)[:seq_len].view(1, seq_len, 1, -1) sin = self.sin.to(device, dtype)[:seq_len].view(1, seq_len, 1, -1) x1, x2 = x.chunk(2, dim=-1) rotated = torch.cat((-x2, x1), dim=-1) return (x * cos) + (rotated * sin) class LlamaAttention(nn.Module): """Grouped-query attention with rotary embeddings.""" def __init__(self, config: LlamaConfig) -> None: super().__init__() self.hidden_size = config.hidden_size self.num_heads = config.num_attention_heads self.head_dim = self.hidden_size // self.num_heads self.num_kv_heads = config.num_key_value_heads assert (self.head_dim * self.num_heads) == self.hidden_size self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=False) self.k_proj = nn.Linear(self.hidden_size, self.num_kv_heads * self.head_dim, bias=False) self.v_proj = nn.Linear(self.hidden_size, self.num_kv_heads * self.head_dim, bias=False) self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=False) def forward(self, hidden_states: Tensor, rope: RotaryPositionEncoding) -> Tensor: bs, seq_len, dim = hidden_states.size() query_states = self.q_proj(hidden_states).view(bs, seq_len, self.num_heads, self.head_dim) key_states = self.k_proj(hidden_states).view(bs, seq_len, self.num_kv_heads, self.head_dim) value_states = self.v_proj(hidden_states).view(bs, seq_len, self.num_kv_heads, self.head_dim) attn_output = F.scaled_dot_product_attention( rope(query_states).transpose(1, 2), rope(key_states).transpose(1, 2), value_states.transpose(1, 2), is_causal=True, dropout_p=0.0, enable_gqa=True, ) attn_output = attn_output.transpose(1, 2).reshape(bs, seq_len, self.hidden_size) return self.o_proj(attn_output) class LlamaMLP(nn.Module): """Feed-forward network with SwiGLU activation.""" def __init__(self, config: LlamaConfig) -> None: super().__init__() self.gate_proj = nn.Linear(config.hidden_size, config.intermediate_size, bias=False) self.up_proj = nn.Linear(config.hidden_size, config.intermediate_size, bias=False) self.act_fn = F.silu self.down_proj = nn.Linear(config.intermediate_size, config.hidden_size, bias=False) def forward(self, x: Tensor) -> Tensor: gate = self.act_fn(self.gate_proj(x)) up = self.up_proj(x) return self.down_proj(gate * up) class LlamaDecoderLayer(nn.Module): """Single transformer layer for a Llama model.""" def __init__(self, config: LlamaConfig) -> None: super().__init__() self.input_layernorm = nn.RMSNorm(config.hidden_size, eps=1e-5) self.self_attn = LlamaAttention(config) self.post_attention_layernorm = nn.RMSNorm(config.hidden_size, eps=1e-5) self.mlp = LlamaMLP(config) def forward(self, hidden_states: Tensor, rope: RotaryPositionEncoding) -> Tensor: residual = hidden_states hidden_states = self.input_layernorm(hidden_states) attn_outputs = self.self_attn(hidden_states, rope=rope) hidden_states = attn_outputs + residual residual = hidden_states hidden_states = self.post_attention_layernorm(hidden_states) return self.mlp(hidden_states) + residual class LlamaModel(nn.Module): """The full Llama model without any pretraining heads.""" def __init__(self, config: LlamaConfig) -> None: super().__init__() self.rotary_emb = RotaryPositionEncoding( config.hidden_size // config.num_attention_heads, config.max_position_embeddings, ) self.embed_tokens = nn.Embedding(config.vocab_size, config.hidden_size) self.layers = nn.ModuleList([ LlamaDecoderLayer(config) for _ in range(config.num_hidden_layers) ]) self.norm = nn.RMSNorm(config.hidden_size, eps=1e-5) def forward(self, input_ids: Tensor) -> Tensor: hidden_states = self.embed_tokens(input_ids) for layer in self.layers: hidden_states = layer(hidden_states, rope=self.rotary_emb) return self.norm(hidden_states) class LlamaForPretraining(nn.Module): def __init__(self, config: LlamaConfig) -> None: super().__init__() self.base_model = LlamaModel(config) self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False) def forward(self, input_ids: Tensor) -> Tensor: hidden_states = self.base_model(input_ids) return self.lm_head(hidden_states) def apply_repetition_penalty(logits: Tensor, tokens: list[int], penalty: float) -> Tensor: """Apply repetition penalty to the logits.""" for tok in tokens: if logits[tok] > 0: logits[tok] /= penalty else: logits[tok] *= penalty return logits @torch.no_grad() def generate(model, tokenizer, prompt, max_tokens=100, temperature=1.0, repetition_penalty=1.0, repetition_penalty_range=10, top_k=50, device=None) -> str: """Generate text autoregressively from a prompt. Args: model: The trained LlamaForPretraining model tokenizer: The tokenizer prompt: Input text prompt max_tokens: Maximum number of tokens to generate temperature: Sampling temperature (higher = more random) repetition_penalty: Penalty for repeating tokens repetition_penalty_range: Number of previous tokens to consider for repetition penalty top_k: Only sample from top k most likely tokens device: Device the model is loaded on Returns: Generated text """ # Turn model to evaluation mode: Norm layer will work differently model.eval() # Get special token IDs bot_id = tokenizer.token_to_id("[BOT]") eot_id = tokenizer.token_to_id("[EOT]") # Tokenize the prompt into integer tensor prompt_tokens = [bot_id] + tokenizer.encode(" " + prompt).ids input_ids = torch.tensor([prompt_tokens], dtype=torch.int64, device=device) # Recursively generate tokens generated_tokens = [] for _step in range(max_tokens): # Forward pass through model logits = model(input_ids) # Get logits for the last token next_token_logits = logits[0, -1, :] / temperature # Apply repetition penalty if repetition_penalty != 1.0 and len(generated_tokens) > 0: next_token_logits = apply_repetition_penalty( next_token_logits, generated_tokens[-repetition_penalty_range:], repetition_penalty, ) # Apply top-k filtering if top_k > 0: top_k_logits = torch.topk(next_token_logits, top_k)[0] indices_to_remove = next_token_logits < top_k_logits[-1] next_token_logits[indices_to_remove] = float("-inf") # Sample from the filtered distribution probs = F.softmax(next_token_logits, dim=-1) next_token = torch.multinomial(probs, num_samples=1) # Early stop if EOT token is generated if next_token.item() == eot_id: break # Append the new token to input_ids for next iteration input_ids = torch.cat([input_ids, next_token.unsqueeze(0)], dim=1) generated_tokens.append(next_token.item()) # Decode all generated tokens return tokenizer.decode(generated_tokens) checkpoint = "llama_model_final.pth" # saved model checkpoint tokenizer = "bpe_50K.json" # saved tokenizer max_tokens = 100 temperature = 0.9 top_k = 50 penalty = 1.1 penalty_range = 10 # Load tokenizer and model device = torch.device("cuda" if torch.cuda.is_available() else "cpu") tokenizer = tokenizers.Tokenizer.from_file(tokenizer) config = LlamaConfig() model = LlamaForPretraining(config).to(device) model.load_state_dict(torch.load(checkpoint, map_location=device)) prompt = "Once upon a time, there was" response = generate( model=model, tokenizer=tokenizer, prompt=prompt, max_tokens=max_tokens, temperature=temperature, top_k=top_k, repetition_penalty=penalty, repetition_penalty_range=penalty_range, device=device, ) print(prompt) print("-" * 20) print(response) |
If you use the model you trained exactly as described in the previous article, the results will likely be poor. As a model to complete a sentence, it may even produce gibberish. This is because you trained a very small model on a very small dataset (merely 10 billion tokens). Any model of reasonable performance is trained on a trillion or more tokens.
You can modify the code above to use a model from Hugging Face Hub. For example, you can use TinyLlama v1.1 or the Llama 3.2 1B model, both of which are compact with 1 billion parameters. The modified script is as follows:
import torch import torch.nn.functional as F from torch import Tensor from transformers import AutoModelForCausalLM, AutoTokenizer def apply_repetition_penalty(logits: Tensor, tokens: list[int], penalty: float) -> Tensor: """Apply repetition penalty to the logits.""" for tok in tokens: if logits[tok] > 0: logits[tok] /= penalty else: logits[tok] *= penalty return logits @torch.no_grad() def generate(model, tokenizer, prompt, max_tokens=100, temperature=1.0, repetition_penalty=1.0, repetition_penalty_range=10, top_k=50, device=None) -> str: """Generate text autoregressively from a prompt. Args: model: The Hugging Face model for causal language modeling tokenizer: The Hugging Face tokenizer prompt: Input text prompt max_tokens: Maximum number of tokens to generate temperature: Sampling temperature (higher = more random) repetition_penalty: Penalty for repeating tokens repetition_penalty_range: Number of previous tokens to consider for repetition penalty top_k: Only sample from top k most likely tokens device: Device the model is loaded on Returns: Generated text """ # Turn model to evaluation mode model.eval() # Get EOT token ID eot_id = tokenizer.eos_token_id # Tokenize the prompt into integer tensor input_ids = tokenizer.encode(prompt, return_tensors="pt").to(device) # Recursively generate tokens generated_tokens = [] for _step in range(max_tokens): # Forward pass through model from `transformers` library logits = model(input_ids).logits # Get logits for the last token next_token_logits = logits[0, -1, :] / temperature # Apply repetition penalty if repetition_penalty != 1.0 and len(generated_tokens) > 0: next_token_logits = apply_repetition_penalty( next_token_logits, generated_tokens[-repetition_penalty_range:], repetition_penalty, ) # Apply top-k filtering if top_k > 0: top_k_logits = torch.topk(next_token_logits, top_k)[0] indices_to_remove = next_token_logits < top_k_logits[-1] next_token_logits[indices_to_remove] = float("-inf") # Sample from the filtered distribution probs = F.softmax(next_token_logits, dim=-1) next_token = torch.multinomial(probs, num_samples=1) # Early stop if EOT token is generated if next_token.item() == eot_id: break # Append the new token to input_ids for next iteration input_ids = torch.cat([input_ids, next_token.unsqueeze(0)], dim=1) generated_tokens.append(next_token.item()) # Decode all generated tokens return tokenizer.decode(generated_tokens, skip_special_tokens=True) model_id = "meta-llama/Llama-3.2-1B" # or "TinyLlama/TinyLlama_v1.1" max_tokens = 100 temperature = 0.9 top_k = 50 penalty = 1.1 penalty_range = 10 # Load tokenizer and model device = torch.device("cuda" if torch.cuda.is_available() else "cpu") tokenizer = AutoTokenizer.from_pretrained(model_id) if tokenizer.pad_token is None: tokenizer.pad_token = tokenizer.eos_token model = AutoModelForCausalLM.from_pretrained( model_id, dtype=torch.bfloat16 if torch.cuda.is_available() else torch.float32, device_map="auto" if torch.cuda.is_available() else None, ) prompt = "Once upon a time, there was" response = generate( model=model, tokenizer=tokenizer, prompt=prompt, max_tokens=max_tokens, temperature=temperature, top_k=top_k, repetition_penalty=penalty, repetition_penalty_range=penalty_range, device=device, ) print(prompt) print("-" * 20) print(response) |
When you run this script, you may get a result like this:
Once upon a time, there was -------------------- a little boy whose name was Thomas. Thomas's mother had died when he was a little boy, and his father had died even sooner. Thomas found this out when his father was driving a truck. Thomas's grandfather had taught him how to drive by watching TV. His grandfather took the truck out to a remote part of the forest. And then he started to tell Thomas, "Now sit here and watch. Everything about this." And then he stopped talking, put on the radio and said,"Now |
The result is not bad. Let’s try a more sophisticated prompt:
... # prompt = "Once upon a time, there was a" # replaced with the following prompt = """ Below is an instruction that describes a task. Write a response that appropriately completes the request. ### Instruction: Find the verb in the following sentence. ### Input: John and Mary walk across the street to the bookstore. ### Response: """ |
You will find that the same model now gives you a result that is not quite right:
Below is an instruction that describes a task. Write a response that appropriately completes the request. ### Instruction: Find the verb in the following sentence. ### Input: John and Mary walk across the street to the bookstore. ### Response: -------------------- Answer the following questions about 12th grade. 1. When did you start working at 12th grade? 2. On which day did you retire from 12th grade? 3. What is the most exciting thing you've seen recently? 4. On which day did you go fishing with your family? 5. What can you get from 12th grade? 6. On what day are you the most excited right now? 7. Was that time when you were 12th |
But if you replaced the model with an “instruct” model, such as below:
... # model_name = "meta-llama/Llama-3.2-1B" # replaced with the following model_name = "meta-llama/Llama-3.2-1B-Instruct" |
You will get:
Below is an instruction that describes a task. Write a response that appropriately completes the request. ### Instruction: Find the verb in the following sentence. ### Input: John and Mary walk across the street to the bookstore. ### Response: -------------------- The verb in the sentence is "walk." |
The answer is correct: It sounds like the model understands what you want and gives you a sensible answer.
Why is there a difference? A pretrained model is a language model that learns about the structure of a language. However, a model trained on a large corpus of random text is unlikely to know that any instruction you provide means it has to act on it. Instruction fine-tuning provides many examples of such instructions and trains the model again, so that, when you give it an instruction, the model should act on it. Similarly, there is conversation fine-tuning to create a chat model. When you provide a chat history to train the model, it learns how to respond to the user’s message.
Dataset for Fine-tuning
The success of fine-tuning a model largely depends on the dataset you use. Depending on the fine-tuning goal, you need to select and prepare the dataset accordingly. Some common examples are:
- Fine-tuning on text completion and reasoning: This is the simplest use case. It simply means using a different corpus to retrain the model. For example, further training Llama 2 on a code-specific dataset yields the Code Llama model, which can generate code and natural-language descriptions of code.
- Fine-tuning on instruction following: This is the most common use case. Use an instruction-following dataset to fine-tune a pretrained model. This enables the model to handle single-turn directives such as “write”, “list”, and “explain”. The dataset contains samples of instruction-desired output pairs. Models fine-tuned on instruction-following datasets are well-suited to serve as task agents.
- Fine-tuning on chat: This creates a chat model that can generate natural, engaging multi-turn dialogue. The dataset contains samples of conversation history. The model will learn the back-and-forth exchanges while maintaining a persona. The fine-tuned models are chatbots and virtual assistants, such as the one behind ChatGPT.
Numerous datasets are available for fine-tuning. You can search for
them on Hugging Face Hub. One example for instruction-following is tatsu-lab/alpaca, which contains 52K samples of instruction-desired output pairs. One example of a chat dataset is the HuggingFaceH4/ultrachat_200k dataset, which contains 200K samples of chat history.
Below, you will see how to run fine-tuning on these datasets.
Fine-tuning Procedure
From a training-script perspective, fine-tuning a model is no different from pretraining it. The same auto-regressive model architecture is used, and the loss function remains cross-entropy between the output logits and the target tokens. The only difference lies in the dataset used to train the model. It usually fits a specific template, and there are far fewer samples than in the pretraining dataset. Because of the smaller dataset, the fine-tuning process is much shorter, too.
Let’s fine-tune an instruction-following model. The template to use is as follows:
Below is an instruction that describes a task. Write a response that appropriately completes the request. ### Instruction: {instruction} ### Input: {input} ### Response: {output} |
This is the one used in the example above. You are expected to provide a dataset with instructions, input, and output. Then, you substitute them into the template above to create a prompt for the model. The model will learn not only to understand your instruction, but also that this template indicates it should act on the instruction and produce the response.
Take the dataset tatsu-lab/alpaca as an example. You can create samples as follows:
from datasets import load_dataset ds = load_dataset("tatsu-lab/alpaca", split="train") print(ds) print("---") print(ds[0]) print("---") template = """ Below is an instruction that describes a task. Write a response that appropriately completes the request. ### Instruction: {instruction} ### Input: {input} ### Response: {output} """ samples = [] for sample in ds: samples.append(template.format( instruction=sample["instruction"], input=sample["input"], output=sample["output"]) ) break print(samples[0]) |
When you run this code, you will get:
Dataset({ features: ['instruction', 'input', 'output', 'text'], num_rows: 52002 }) --- {'instruction': 'Give three tips for staying healthy.', 'input': '', 'output': '1.Eat a ba lanced diet and make sure to include plenty of fruits and vegetables. \n2. Exercise regula rly to keep your body active and strong. \n3. Get enough sleep and maintain a consistent s leep schedule.', 'text': 'Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction:\nGive three tips for staying healthy.\n\n### Response:\n1.Eat a balanced diet and make sure to include plenty of fruit s and vegetables. \n2. Exercise regularly to keep your body active and strong. \n3. Get e nough sleep and maintain a consistent sleep schedule.'} --- Below is an instruction that describes a task. Write a response that appropriately completes the request. ### Instruction: Give three tips for staying healthy. ### Input: ### Response: 1.Eat a balanced diet and make sure to include plenty of fruits and vegetables. 2. Exercise regularly to keep your body active and strong. 3. Get enough sleep and maintain a consistent sleep schedule. |
As shown, the tatsu-lab/alpaca dataset contains 52,002 samples in the train split (this dataset has only one split). Each sample has four columns: instruction, input, output, and text. You used instruction, input, and output
from each sample to fill in the template and generate a complete
prompt-response string. This is what you will use to fine-tune your
model.
Knowing how to create samples from a dataset, you can fine-tune your model with the following code:
import datasets import torch import torch.nn as nn import torch.optim.lr_scheduler as lr_scheduler import tqdm from datasets import load_dataset from torch import Tensor from transformers import AutoModelForCausalLM, AutoTokenizer # Load training dataset ds = load_dataset("tatsu-lab/alpaca", split="train") template = """ Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request. ### Instruction: {instruction} ### Input: {input} ### Response: {output} """.strip() class FineTuningDataset(torch.utils.data.Dataset): def __init__(self, dataset: datasets.Dataset, tokenizer: AutoTokenizer, seq_length: int, device: torch.device = None): self.dataset = dataset self.tokenizer = tokenizer self.device = device self.seq_length = seq_length self.bot = torch.tensor([tokenizer.bos_token_id], dtype=torch.int64, device=device) self.eot = torch.tensor([tokenizer.eos_token_id], dtype=torch.int64, device=device) self.pad = torch.tensor([tokenizer.pad_token_id], dtype=torch.int64, device=device) def __len__(self): return len(self.dataset) def __getitem__(self, index: int): """Get a sequence of token ids from the dataset. [BOT] and [EOT] tokens are added. Clipped and padded to the sequence length. """ seq = template.format( instruction=self.dataset[index]["instruction"], input=self.dataset[index]["input"], output=self.dataset[index]["output"], ) tokens: Tensor = self.tokenizer(seq, return_tensors="pt", add_special_tokens=False).input_ids[0] tokens = torch.cat([self.bot, tokens.to(self.device), self.eot], dim=0) # pad to target sequence length toklen = len(tokens) if toklen < self.seq_length+1: pad_length = self.seq_length+1 - toklen tokens = torch.cat([tokens, self.pad.repeat(pad_length)], dim=0) # return the sequence x = tokens[:self.seq_length] y = tokens[1:self.seq_length+1] return x, y # Load model and tokenizer model_id = "meta-llama/Llama-3.2-1B" tokenizer = AutoTokenizer.from_pretrained(model_id, use_fast=False) tokenizer.pad_token = tokenizer.eos_token device = torch.device("cuda" if torch.cuda.is_available() else "cpu") model = AutoModelForCausalLM.from_pretrained( model_id, dtype=torch.bfloat16 if torch.cuda.is_available() else torch.float32, device_map=device, ) model.train() # Training parameters epochs = 2 learning_rate = 1e-4 batch_size = 4 seq_length = 1024 num_warmup_steps = 1000 PAD_TOKEN_ID = tokenizer.pad_token_id # DataLoader, optimizer, scheduler, and loss function dataset = FineTuningDataset(ds, tokenizer, seq_length=seq_length, device=device) dataloader = torch.utils.data.DataLoader(dataset, batch_size=batch_size) optimizer = torch.optim.AdamW( model.parameters(), lr=learning_rate, betas=(0.9, 0.99), eps=1e-8, weight_decay=0.1, ) num_training_steps = len(dataloader) * epochs scheduler = lr_scheduler.CosineAnnealingLR( optimizer, T_max=num_training_steps, eta_min=0, ) loss_fn = nn.CrossEntropyLoss(ignore_index=PAD_TOKEN_ID) # start training for epoch in range(epochs): pbar = tqdm.tqdm(dataloader, desc=f"Epoch {epoch+1}/{epochs}") for batch in pbar: # get batched data input_ids, target_ids = batch # create attention mask for Llama model from transformers attn_mask = input_ids.ne(PAD_TOKEN_ID) # extract output from model logits = model(input_ids, attention_mask=attn_mask).logits # compute loss: cross-entropy between logits and target, ignoring padding tokens loss = loss_fn(logits.view(-1, logits.size(-1)), target_ids.view(-1)) # backward with loss and gradient clipping by L2 norm to 1.0 optimizer.zero_grad() loss.backward() torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0) optimizer.step() scheduler.step() pbar.set_postfix(loss=loss.item()) pbar.update(1) pbar.close() # Save the model torch.save(model.state_dict(), "model-instruct.pth") |
This code fine-tunes the Llama 3.2 1B model from Hugging Face Hub (model name: meta-llama/Llama-3.2-1B) with the tatsu-lab/alpaca
dataset. Because the dataset is smaller, the training process is much
shorter than pretraining. You can revise the code to fine-tune with your
own model, such as the one you trained in the previous article.
In particular, you need to change how the model and tokenizer are
initialized and how the model processes the BOT and EOT tokens. The
Llama 3.2 1B model is used because it is larger and more capable than
the one you created in the previous article. It should be easier to see the effect of fine-tuning.
Compared with pretraining, the dataset is created using a template rather than the text itself. The loss is still cross-entropy. However, you use a lower learning rate in fine-tuning because the model is already trained to understand the language, and you do not want to destroy this capability. For a similar reason, you usually do not need warmup steps in the learning rate scheduler. Simply decrease the learning rate steadily from the start to the end of training.
Since it is very similar to the self-supervised pretraining, this process is called supervised fine-tuning (SFT).
Let’s see another example: This time, you will fine-tune a chat model. The dataset to use is HuggingFaceH4/ultrachat_200k. Below shows what the dataset looks like:
from datasets import load_dataset ds = load_dataset("HuggingFaceH4/ultrachat_200k") print(ds) print("---") print(ds["train_sft"][0]) print("---") template = "<|{role}|>\n{content}\n" samples = [] for sample in ds["train_sft"]: prompt = "" for message in sample["messages"]: prompt += template.format( role=message["role"], content=message["content"], ) samples.append(prompt) print(samples[0]) |
When you run this code, you will get:
DatasetDict({ train_sft: Dataset({ features: ['prompt', 'prompt_id', 'messages'], num_rows: 207865 }) test_sft: Dataset({ features: ['prompt', 'prompt_id', 'messages'], num_rows: 23110 }) train_gen: Dataset({ features: ['prompt', 'prompt_id', 'messages'], num_rows: 256032 }) test_gen: Dataset({ features: ['prompt', 'prompt_id', 'messages'], num_rows: 28304 }) }) --- {'prompt': "These instructions apply to section-based themes (Responsive 6.0+, Retina 4.0+ ... me developer for assistance with customizing your Quick Shop section to include this featu re.", 'role': 'assistant'}]} --- <|user|> These instructions apply to section-based themes (Responsive 6.0+, Retina 4.0+, Parallax 3.0+ Turbo 2.0+, Mobilia 5.0+). What theme version am I using? On your Collections pages & Featured Collections sections, you can easily show the secondary image of a product on hover by enabling one of the theme's built-in settings! Your Collection pages & Featured Collections sections will now display the secondary product image just by hovering over that product image thumbnail. Does this feature apply to all sections of the theme or just specific ones as listed in the text material? <|assistant|> This feature only applies to Collection pages and Featured Collections sections of the section-based themes listed in the text material. <|user|> Can you guide me through the process of enabling the secondary image hover feature on my Collection pages and Featured Collections sections? <|assistant|> Sure, here are the steps to enable the secondary image hover feature on your Collection pages and Featured Collections sections: ... |
Unlike the instruction dataset, which has one instruction and
one response per sample, this dataset contains multiple messages per
sample, with the roles “user” and “assistant” alternating. The template
you use should mark the roles with <|user|> and <|assistant|> respectively. The prompt you create may be lengthy, as each sample includes multiple back-and-forth exchanges.
To fine-tune the model with this dataset, you can modify the code above on how you create samples:
import datasets import torch import torch.nn as nn import torch.optim.lr_scheduler as lr_scheduler import tqdm from datasets import load_dataset from transformers import AutoModelForCausalLM, AutoTokenizer # Load training dataset ds = load_dataset("HuggingFaceH4/ultrachat_200k", split="train_sft") template = "<|{role}|>\n{content}\n" class FineTuningDataset(torch.utils.data.Dataset): def __init__(self, dataset: datasets.Dataset, tokenizer: AutoTokenizer, seq_length: int, device: torch.device = None): self.dataset = dataset self.tokenizer = tokenizer self.device = device self.seq_length = seq_length self.bot = torch.tensor([tokenizer.bos_token_id], dtype=torch.int64, device=device) self.eot = torch.tensor([tokenizer.eos_token_id], dtype=torch.int64, device=device) self.pad = torch.tensor([tokenizer.pad_token_id], dtype=torch.int64, device=device) def __len__(self): return len(self.dataset) def __getitem__(self, index: int): """Get a sequence of token ids from the dataset. [BOT] and [EOT] tokens are added. Clipped and padded to the sequence length. """ tokens = self.bot for message in self.dataset[index]["messages"]: seq = template.format( role=message["role"], content=message["content"], ) seq = self.tokenizer(seq, return_tensors="pt", add_special_tokens=False).input_ids[0] tokens = torch.cat([tokens, seq.to(self.device), self.eot], dim=0) mask = torch.ones_like(tokens) # pad to target sequence length toklen = len(tokens) if toklen < self.seq_length+1: pad_length = self.seq_length+1 - toklen tokens = torch.cat([tokens, self.pad.repeat(pad_length)], dim=0) mask = torch.cat([mask, torch.zeros(pad_length).to(self.device)], dim=0) # return the sequence x = tokens[:self.seq_length] y = tokens[1:self.seq_length+1] mask = mask[:self.seq_length] return x, y, mask # Load model and tokenizer model_id = "meta-llama/Llama-3.2-1B" tokenizer = AutoTokenizer.from_pretrained(model_id, use_fast=False) tokenizer.pad_token = tokenizer.eos_token device = torch.device("cuda" if torch.cuda.is_available() else "cpu") model = AutoModelForCausalLM.from_pretrained( model_id, dtype=torch.bfloat16 if torch.cuda.is_available() else torch.float32, device_map=device, ) model.train() # Training parameters epochs = 2 learning_rate = 1e-4 batch_size = 4 seq_length = 1024 num_warmup_steps = 1000 PAD_TOKEN_ID = tokenizer.pad_token_id # DataLoader, optimizer, scheduler, and loss function dataset = FineTuningDataset(ds, tokenizer, seq_length=seq_length, device=device) dataloader = torch.utils.data.DataLoader(dataset, batch_size=batch_size) optimizer = torch.optim.AdamW( model.parameters(), lr=learning_rate, betas=(0.9, 0.99), eps=1e-8, weight_decay=0.1, ) num_training_steps = len(dataloader) * epochs scheduler = lr_scheduler.CosineAnnealingLR( optimizer, T_max=num_training_steps, eta_min=0, ) loss_fn = nn.CrossEntropyLoss(ignore_index=PAD_TOKEN_ID) # start training for epoch in range(epochs): pbar = tqdm.tqdm(dataloader, desc=f"Epoch {epoch+1}/{epochs}") for batch in pbar: # get batched data with attention mask input_ids, target_ids, attn_mask = batch # extract output from model logits = model(input_ids, attention_mask=attn_mask).logits # compute loss: cross-entropy between logits and target, ignoring padding tokens loss = loss_fn(logits.view(-1, logits.size(-1)), target_ids.view(-1)) # backward with loss and gradient clipping by L2 norm to 1.0 optimizer.zero_grad() loss.backward() torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0) optimizer.step() scheduler.step() pbar.set_postfix(loss=loss.item()) pbar.update(1) pbar.close() # Save the model torch.save(model.state_dict(), "model-chat.pth") |
There are a few differences from the instruction fine-tuning code above. First, you need a for-loop in the __getitem__() method of the FineTuningDataset
class, since there are multiple exchanges of dialog in each sample.
Second, the BOT token marks the beginning of the prompt, and EOT tokens
are inserted between messages to mark their end. This is important
because, when you use the fine-tuned model for chat, you expect it to
generate an EOT token to mark the end of the response. Since the EOT
token is also used as the padding token here, you need to create an
attention mask to indicate where padding occurs explicitly.
That’s all you need to fine-tune a chat model. When you use the fine-tuned model for chat, you can reuse the same generate() function as above to generate the response. However, you need to construct the prompt with the correct pattern:
... # Create a chat loop chat_history = "" template = "<|{role}|>\n{content}\n" while True: # Read user input prompt = input("User: ").strip() if prompt.lower() in ["quit", "exit", "q"]: print("Goodbye!") break chat_history += template.format(role="user", content=prompt) # Produce chatbot response chat_history += "<|assistant|>\n" response = generate( model=model, tokenizer=tokenizer, prompt=chat_history, max_tokens=max_tokens, temperature=temperature, top_k=top_k, repetition_penalty=penalty, repetition_penalty_range=penalty_range, device=device, ) # Print the response and add to the chat history print("-" * 20) print("Assistant:") print(response) print("-" * 20) chat_history += response.strip() + "\n" |
Other Fine-Tuning Techniques
This is the simplest way to fine-tune a model. If the fine-tuned model is not performing as expected, you may consider employing advanced techniques to improve the results. This is not about the speed at which the model processes the prompt and produces the response, as engineering techniques can boost inference throughput. Rather, this is about aligning the model’s response with our expectations.
RLHF is a technique for fine-tuning a model using reinforcement learning from human feedback. It is a method for the model to learn from human feedback on its responses. Human feedback is used to guide the model’s responses and align them more closely with human expectations. Fine-tuning a model with RLHF is a reinforcement learning process. Instead of a simple loss function, you need to create a reward function that measures how good the model’s response is. Then you can use the reward to guide the model’s response and align it more closely with human expectations. The common algorithm for RLHF is called Proximal Policy Optimization (PPO). Another algorithm that claims to be more efficient is Direct Preference Optimization (DPO).
In addition to the fine-tuning described above, you can choose not to update the entire model and instead use a smaller supplementary model to generate the response. This is called parameter-efficient fine-tuning (PEFT). PEFT is based on the observation that smaller models train faster. The most notable technique in PEFT is LoRA (Low-Rank Adaptation). It keeps the majority of the original model’s weights frozen. Then, a small number of trainable parameters, called adapters, are added to the output of selected layers of the model. During fine-tuning, the same training loop is used, but only the adapters are updated. It is called low-rank adaptation because, instead of a large weight matrix in the original model, it uses smaller matrices.
These are advanced fine-tuning techniques that typically require additional libraries to run the training. These will not be covered in this article.
Further Readings
Below are some resources that you may find useful:
- Jay Alammar & Maarten Grootendorst, Hands-On Large Language Models, O’Reilly 2024
- Sebastian Raschka, Building Large Language Models from Scratch, Manning Publications 2024
- Illustrating Reinforcement Learning from Human Feedback (RLHF), blog post by Hugging Face
- LLM Research Insights: Instruction Masking and New LoRA Finetuning Experiments, blog post by Sebastian Raschka
- Ouyang et al (2022) Training language models to follow instructions with human feedback
- Hu et al (2021) LoRA: Low-Rank Adaptation of Large Language Models
- Rafailov et al (2024) Direct Preference Optimization: Your Language Model is Secretly a Reward Model
- Schulman et al (2017) Proximal Policy Optimization Algorithms
Summary
In this article, you learned how to fine-tune a language model. In particular, you learned that:
- You can fine-tune a model on a new corpus, as in self-supervised pretraining, to enhance its understanding of a specific domain.
- You can fine-tune a model to follow instructions or to chat with a persona using an instruction-following dataset or a chat dataset, respectively, with an appropriate template.
- Fine-tuning a model is usually faster because the pretrained model already understands the language, and you can use a much smaller dataset.
- There are advanced fine-tuning techniques to improve model performance, such as RLHF and PEFT.

No comments:
Post a Comment