Product

Friday, 20 March 2026

Pretraining a Llama Model on Your Local GPU

 

Decoder-only language models like Llama are usually trained using self-supervised learning objectives on large amounts of text. This is called pretraining to distinguish it from later fine-tuning steps on specific tasks. In this article, you will learn how to pretrain a Llama model on a local GPU. Specifically, you will learn how to:

  • Prepare the training data
  • Run the pretraining

Let’s get started.

Pretraining a Llama Model on Your Local GPU
Photo by Hongbin. Some rights reserved.

Overview

This article is divided into three parts; they are:

  • Training a Tokenizer with Special Tokens
  • Preparing the Training Data
  • Running the Pretraining

Training a Tokenizer with Special Tokens

The model architecture you will use is the same as the one created in the previous post. This is a 12-layer Llama model with a vocabulary size of 50,000. The data you will use for pretraining is the HuggingFaceFW/fineweb dataset.

To prepare the training data, you first need to set up the tokenizer. To recap, the following code trains a BPE tokenizer on the HuggingFaceFW/fineweb dataset and saves it to a file:

This tokenizer uses the BPE (byte-pair encoding) algorithm at the byte level. Normally, it would not emit any unknown tokens, but you still set a special token for them. Additionally, you set special tokens for the beginning of text ([BOT]), end of text ([EOT]), and padding ([PAD]). These are useful for next-token prediction.

This code automatically uses all CPU cores. Running this code will take a few minutes on a high-end computer. The trained tokenizer will be saved to a file named bpe_50k.json. Once trained, you can load it back with the following code:

Note that you trained the tokenizer with a vocabulary size of 50,000. This is fairly useful for a single-language model. However, if you intend to train a model for multiple languages, a larger vocabulary size is preferred.

Preparing the Training Data

Pretraining a language model means predicting the next token in a sequence. With the training data, you need to tokenize the text to create a tensor of integer token IDs and a shift-by-one version as the prediction target.

As you can see from the previous section, you can load the dataset and print out the text as strings by iterating over the dataset object:

This dataset is small compared to those usually used for language model training. However, it is still large enough to contain diverse samples of human language.

For pretraining, you need to create a PyTorch Dataset object so that your model can consume it, as follows:

This is the simplest way to tokenize text data for pretraining. You wrap around the Hugging Face dataset object, matching the number of samples in the __len__ method. In the __getitem__ method, you tokenize a particular text sample into a tensor of integer token IDs. You add the begin of text and end of text tokens to help with pretraining: When you provide just the begin of text token, the model can predict the first token of a sentence. When you provide the entire sequence, the model should predict it is the end.

A transformer model does not limit the length you pass to it, except for a maximum sequence length that the positional encoding can handle. However, when you pass multiple sequences as a batch, you need to ensure all sequences have the same length so you can stack them into a single tensor. You add padding tokens to shorter sequences and clip longer sequences to the target sequence length.

Pretraining is self-supervised learning. The label for the expected output is already in the input sequence. Therefore, you set x as the input sequence and its shift-by-one version as the target sequence y. You want them to be PyTorch tensors instead of Python lists so you can use them with a PyTorch data loader. You must also set the data type to int64 due to a limitation of PyTorch’s CrossEntropyLoss, which requires this type to recognize padding tokens when computing the training loss.

You can test the dataset by creating a DataLoader object and drawing a batch from it:

Running the Pretraining

Once you have the input and target data ready from the dataset, running pretraining on a language model is no different from training other deep learning models.

Using the model code from the previous post, let’s first create a model object:

This is a small model for demonstration purposes. It has only 171 million parameters, much smaller than any large language model you can find on the internet.

Next, you should define the training parameters. Depending on your hardware, you may want to adjust the batch size, but keeping the sequence length moderately long helps the model learn context. Here is the strategy to use:

  • This dataset has only a training split. For simplicity, the data is not shuffled, no holdout set is created, and the training loop does not contain any evaluation step.
  • Next-token prediction is a classification problem over the entire vocabulary. Naturally, the loss function is cross-entropy. You should ensure that padding tokens are not used in computing the loss, as they are not valid inputs.
  • Set the sequence length to 512. The resources required to train a model scale as 𝑂⁡(𝑁2) with sequence length. Therefore, you prefer to keep it short, but a sequence length that is too short prevents the model from understanding longer contexts.
  • Following best practices for training large language models, use a cosine learning rate scheduler with a warmup period. The warmup period can be set to a fixed number of steps or to a percentage of the total training steps (e.g., 0.1%-2%). Let’s set it to 1,000 steps here.
  • Once the sequence length is determined, adjust the batch size to fit your GPU memory. You can start with 8, which empirically fits into 12GB of VRAM.
  • With 14 million samples and 10 billion tokens in the HuggingFaceFW/fineweb 10B dataset, you probably do not need to train for many epochs. In fact, many large language models are trained for only 1-3 epochs on very large datasets.

Let’s put these parameters together to define the training configuration:

The AdamW optimizer is configured with a peak learning rate of 1e-3. Other parameters are set to their defaults. The cosine scheduler from PyTorch is combined with a linear scheduler to implement the warmup period. They are combined using the SequentialLR scheduler and configured to switch from a linear to a cosine schedule at the 1,000th step.

Note that you did not set streaming=True when loading the dataset for training, nor did you shuffle the dataset. This makes the DataLoader object deterministic. This way, you can easily determine the total number of training steps, which helps you set up the learning rate scheduler.

The loss function uses nn.CrossEntropyLoss with the padding token ID set as the ignore index. This means whenever the reference target is a padding token, the loss is not computed. This is important to match the behavior you defined when you created the dataset object in the previous section.

This is a small model and a small dataset by large language model standards. However, the training is still very slow. Running the training from scratch on a single GPU will take several hundred hours. It is important that you can checkpoint the model and resume training. Let’s implement this in a training loop:

When you checkpoint the training, you need to save the model state, the optimizer state, and the scheduler state. You also need to remember the epoch and batch index so you can resume from the same batch in the dataset.

You visualize the training progress with a progress bar from the tqdm library. During training, you pull a pair of input and target tensors from the DataLoader object. The datasets library allows you to skip an arbitrary number of samples. You use this to create a DataLoader object to resume from the previous checkpoint.

Then you create an attention mask to mask out padding tokens and enable causal masking to control the self-attention mechanism. The model output is a 3D tensor with the same batch size and sequence length as your input. You need to reshape it for the loss function, then update the model with the computed loss. Everything is standard for training a deep learning model.

At the end, you can save the model so you can reuse it for inference:

Depending on your use case, you may want to save the base model, the pretraining model, or both. The base model is useful for other tasks, while the pretraining model is useful as a generative model.

For completeness, below is the full code for the training:

Note that this is a simplified training recipe. A professional model training process would use a much larger dataset on a much larger model. For example, Llama 2 models with 7B-70B parameters are trained on 2 trillion tokens. The hyperparameters for training, such as the learning rate, would be tuned before they are finalized for actual training.

Moreover, it would be more efficient to train the model with shorter sequence lengths first, then expand to longer ones later. It is also known to train the model on lower-quality data initially and then use higher-quality data toward the end to make the model more expressive. None of these techniques is implemented in the code above. You can refer to the previous post for techniques to improve the training.

Further Reading

Below are some further reading materials that you may find useful:

Summary

In this article, you learned how to pretrain a Llama model on a single GPU. Specifically, you learned how to:

  • Train a tokenizer with special tokens for next-token prediction
  • Prepare the training data for pretraining
  • Run the pretraining on a single GPU with checkpointing

No comments:

Post a Comment

Connect broadband

Introduction to Small Language Models: The Complete Guide for 2026

  In this article, you will learn what small language models are, why they matter in 2026, and how to...