Product

Friday, 20 March 2026

A Gentle Introduction to Language Model Fine-tuning

 

After pretraining, a language model learns about human languages. You can enhance the model’s domain-specific understanding by training it on additional data. You can also train the model to perform specific tasks when you provide a specific instruction. These additional training after pretraining is called fine-tuning. In this article, you will learn how to fine-tune a language model. Specifically, you will learn:

  • Different examples of fine-tuning and what their goals are
  • How to convert a pretraining script to perform fine-tuning

Let’s get started!

A Gentle Introduction to Language Model Fine-tuning
Photo by Nick Night. Some rights reserved.

Overview

This article is divided into four parts; they are:

  • The Reason for Fine-tuning a Model
  • Dataset for Fine-tuning
  • Fine-tuning Procedure
  • Other Fine-Tuning Techniques

The Reason for Fine-tuning a Model

Once you train your decoder-only transformer model, you have a text generator. You can provide any prompt, and the model will generate some text. What it generates depends on the model you have.

Let’s consider a very simple generation algorithm:

The function generate() above is an inefficient yet simple sampling-based text-generation method. Your model takes a prompt and generates a tensor of logits for the next token. They are called logits because they are proportional to the log of the probabilities of the next token. The model works on tokens. To generate a token, there are a few steps to work on the logits:

  1. Scale the logits by a temperature parameter. This skews the probabilities of the next token to pick.
  2. Manipulate the logits. In the above, you applied a repetition penalty to give a penalty to the tokens that already exist in the generated sequence of tokens. You also applied top-𝑘 filtering to limit the choice to among the top 𝑘 most likely tokens.
  3. Convert the logits to probabilities, then use a multinomial sampling algorithm to select the next token.

You can make this simpler by always using torch.argmax() to pick the next token. This is called greedy decoding. It is generally not preferred because the output does not appear natural, and no variation is permitted.

You can try to use it with your own model trained in the previous article. Below is a complete code to generate text with a simple prompt:

If you use the model you trained exactly as described in the previous article, the results will likely be poor. As a model to complete a sentence, it may even produce gibberish. This is because you trained a very small model on a very small dataset (merely 10 billion tokens). Any model of reasonable performance is trained on a trillion or more tokens.

You can modify the code above to use a model from Hugging Face Hub. For example, you can use TinyLlama v1.1 or the Llama 3.2 1B model, both of which are compact with 1 billion parameters. The modified script is as follows:

When you run this script, you may get a result like this:

The result is not bad. Let’s try a more sophisticated prompt:

You will find that the same model now gives you a result that is not quite right:

But if you replaced the model with an “instruct” model, such as below:

You will get:

The answer is correct: It sounds like the model understands what you want and gives you a sensible answer.

Why is there a difference? A pretrained model is a language model that learns about the structure of a language. However, a model trained on a large corpus of random text is unlikely to know that any instruction you provide means it has to act on it. Instruction fine-tuning provides many examples of such instructions and trains the model again, so that, when you give it an instruction, the model should act on it. Similarly, there is conversation fine-tuning to create a chat model. When you provide a chat history to train the model, it learns how to respond to the user’s message.

Dataset for Fine-tuning

The success of fine-tuning a model largely depends on the dataset you use. Depending on the fine-tuning goal, you need to select and prepare the dataset accordingly. Some common examples are:

  • Fine-tuning on text completion and reasoning: This is the simplest use case. It simply means using a different corpus to retrain the model. For example, further training Llama 2 on a code-specific dataset yields the Code Llama model, which can generate code and natural-language descriptions of code.
  • Fine-tuning on instruction following: This is the most common use case. Use an instruction-following dataset to fine-tune a pretrained model. This enables the model to handle single-turn directives such as “write”, “list”, and “explain”. The dataset contains samples of instruction-desired output pairs. Models fine-tuned on instruction-following datasets are well-suited to serve as task agents.
  • Fine-tuning on chat: This creates a chat model that can generate natural, engaging multi-turn dialogue. The dataset contains samples of conversation history. The model will learn the back-and-forth exchanges while maintaining a persona. The fine-tuned models are chatbots and virtual assistants, such as the one behind ChatGPT.

Numerous datasets are available for fine-tuning. You can search for them on Hugging Face Hub. One example for instruction-following is tatsu-lab/alpaca, which contains 52K samples of instruction-desired output pairs. One example of a chat dataset is the HuggingFaceH4/ultrachat_200k dataset, which contains 200K samples of chat history.

Below, you will see how to run fine-tuning on these datasets.

Fine-tuning Procedure

From a training-script perspective, fine-tuning a model is no different from pretraining it. The same auto-regressive model architecture is used, and the loss function remains cross-entropy between the output logits and the target tokens. The only difference lies in the dataset used to train the model. It usually fits a specific template, and there are far fewer samples than in the pretraining dataset. Because of the smaller dataset, the fine-tuning process is much shorter, too.

Let’s fine-tune an instruction-following model. The template to use is as follows:

This is the one used in the example above. You are expected to provide a dataset with instructions, input, and output. Then, you substitute them into the template above to create a prompt for the model. The model will learn not only to understand your instruction, but also that this template indicates it should act on the instruction and produce the response.

Take the dataset tatsu-lab/alpaca as an example. You can create samples as follows:

When you run this code, you will get:

As shown, the tatsu-lab/alpaca dataset contains 52,002 samples in the train split (this dataset has only one split). Each sample has four columns: instruction, input, output, and text. You used instruction, input, and output from each sample to fill in the template and generate a complete prompt-response string. This is what you will use to fine-tune your model.

Knowing how to create samples from a dataset, you can fine-tune your model with the following code:

This code fine-tunes the Llama 3.2 1B model from Hugging Face Hub (model name: meta-llama/Llama-3.2-1B) with the tatsu-lab/alpaca dataset. Because the dataset is smaller, the training process is much shorter than pretraining. You can revise the code to fine-tune with your own model, such as the one you trained in the previous article. In particular, you need to change how the model and tokenizer are initialized and how the model processes the BOT and EOT tokens. The Llama 3.2 1B model is used because it is larger and more capable than the one you created in the previous article. It should be easier to see the effect of fine-tuning.

Compared with pretraining, the dataset is created using a template rather than the text itself. The loss is still cross-entropy. However, you use a lower learning rate in fine-tuning because the model is already trained to understand the language, and you do not want to destroy this capability. For a similar reason, you usually do not need warmup steps in the learning rate scheduler. Simply decrease the learning rate steadily from the start to the end of training.

Since it is very similar to the self-supervised pretraining, this process is called supervised fine-tuning (SFT).

Let’s see another example: This time, you will fine-tune a chat model. The dataset to use is HuggingFaceH4/ultrachat_200k. Below shows what the dataset looks like:

When you run this code, you will get:

Unlike the instruction dataset, which has one instruction and one response per sample, this dataset contains multiple messages per sample, with the roles “user” and “assistant” alternating. The template you use should mark the roles with <|user|> and <|assistant|> respectively. The prompt you create may be lengthy, as each sample includes multiple back-and-forth exchanges.

To fine-tune the model with this dataset, you can modify the code above on how you create samples:

There are a few differences from the instruction fine-tuning code above. First, you need a for-loop in the __getitem__() method of the FineTuningDataset class, since there are multiple exchanges of dialog in each sample. Second, the BOT token marks the beginning of the prompt, and EOT tokens are inserted between messages to mark their end. This is important because, when you use the fine-tuned model for chat, you expect it to generate an EOT token to mark the end of the response. Since the EOT token is also used as the padding token here, you need to create an attention mask to indicate where padding occurs explicitly.

That’s all you need to fine-tune a chat model. When you use the fine-tuned model for chat, you can reuse the same generate() function as above to generate the response. However, you need to construct the prompt with the correct pattern:

Other Fine-Tuning Techniques

This is the simplest way to fine-tune a model. If the fine-tuned model is not performing as expected, you may consider employing advanced techniques to improve the results. This is not about the speed at which the model processes the prompt and produces the response, as engineering techniques can boost inference throughput. Rather, this is about aligning the model’s response with our expectations.

RLHF is a technique for fine-tuning a model using reinforcement learning from human feedback. It is a method for the model to learn from human feedback on its responses. Human feedback is used to guide the model’s responses and align them more closely with human expectations. Fine-tuning a model with RLHF is a reinforcement learning process. Instead of a simple loss function, you need to create a reward function that measures how good the model’s response is. Then you can use the reward to guide the model’s response and align it more closely with human expectations. The common algorithm for RLHF is called Proximal Policy Optimization (PPO). Another algorithm that claims to be more efficient is Direct Preference Optimization (DPO).

In addition to the fine-tuning described above, you can choose not to update the entire model and instead use a smaller supplementary model to generate the response. This is called parameter-efficient fine-tuning (PEFT). PEFT is based on the observation that smaller models train faster. The most notable technique in PEFT is LoRA (Low-Rank Adaptation). It keeps the majority of the original model’s weights frozen. Then, a small number of trainable parameters, called adapters, are added to the output of selected layers of the model. During fine-tuning, the same training loop is used, but only the adapters are updated. It is called low-rank adaptation because, instead of a large weight matrix in the original model, it uses smaller matrices.

These are advanced fine-tuning techniques that typically require additional libraries to run the training. These will not be covered in this article.

Further Readings

Below are some resources that you may find useful:

Summary

In this article, you learned how to fine-tune a language model. In particular, you learned that:

  • You can fine-tune a model on a new corpus, as in self-supervised pretraining, to enhance its understanding of a specific domain.
  • You can fine-tune a model to follow instructions or to chat with a persona using an instruction-following dataset or a chat dataset, respectively, with an appropriate template.
  • Fine-tuning a model is usually faster because the pretrained model already understands the language, and you can use a much smaller dataset.
  • There are advanced fine-tuning techniques to improve model performance, such as RLHF and PEFT.

No comments:

Post a Comment

Connect broadband

Introduction to Small Language Models: The Complete Guide for 2026

  In this article, you will learn what small language models are, why they matter in 2026, and how to...