Transformer models are trained with a fixed sequence length, but during inference, they may need to process sequences of different lengths. This poses a challenge because positional encodings are computed based on the sequence length. The model might struggle with positional encodings it hasn’t encountered during training.
The ability to handle varying sequence lengths is crucial for a model. This post explores how different positional encoding methods address this challenge.
Kick-start your project with my book Building Transformer Models From Scratch with PyTorch. It provides self-study tutorials with working code.
Let’s get started.

Interpolation in Positional Encodings and Using YaRN for Larger Context Window
Photo by enkuu smile_. Some rights reserved.
Overview
This post is divided into three parts; they are:
- Interpolation and Extrapolation in Sinusoidal Encodings and RoPE
- Interpolation in Learned Encodings
- YaRN for Larger Context Window
Interpolation and Extrapolation in Sinusoidal Encodings and RoPE
Sinusoidal encodings excel at extrapolation due to their use of continuous functions:
You can simply substitute
Alternatively, you can use interpolation. Instead of using
where
These techniques also apply to RoPE.
The function generating sinusoidal positional encodings or RoPE requires no modification and can handle sequences of any length. However, you may need to fine-tune the model to ensure it can process new encodings it hasn’t seen during training. For example, the Llama model uses RoPE and was trained with a maximum sequence length of 16K. Code Llama, a programming-focused model fine-tuned from Llama, extended the sequence length to 100K tokens with just 1000 fine-tuning steps.
Interpolation in Learned Encodings
Learned positional encodings retrieve position encoding vectors from a lookup table. This means the sequence length is fixed by the table size, making extrapolation impossible. However, interpolation can still handle sequences longer than the training length. For a sequence of length
where
This is a basic interpolation implementation. However, there’s no guarantee the model can handle longer sequences without performance degradation before retraining.
YaRN for Larger Context Window
RoPE is currently the most widely used positional encoding in large language models. Recent research has focused on improving RoPE’s extrapolation capabilities.
YaRN is a method that extends RoPE to handle longer sequences, proving more effective than the interpolation method described above. Recall that RoPE sinusoids are computed as:
for position
In code:
YaRN’s key innovation is scaling RoPE sinusoid frequencies unevenly when expanding sequence length from
Consider the inv_freq_interpolation, a factor of
In NTK-by-part, instead of inv_freq in the code.
YaRN improves upon NTK-by-part by adding a scaling factor
Further Readings
Below are some papers that are related to the topic:
- Length Extrapolation of Transformers: A Survey from the Perspective of Positional Encoding
- Code Llama: Open Foundation Models for Code
- Mesa-Extrapolation: A Weave Position Encoding Method for Enhanced Extrapolation in LLMs
- YaRN: Efficient Context Window Extension of Large Language Models
Summary
In this post, you learned how models trained with shorter context lengths can process longer input sequences. Specifically:
- Sinusoidal encodings and RoPE can be easily extrapolated
- Learned encodings only support interpolation
- YaRN provides an advanced method for scaling RoPE to longer sequence lengths
The goal of scaling positional encodings is to enable models to handle longer input sequences without retraining. This is not an exhaustive list, as more advanced methods continue to build upon these foundational ideas.

No comments:
Post a Comment