Artificial Intelligence , Machine Learning and Data Science Hubspot

Unlock the Power of Artificial Intelligence, Machine Learning, and Data Science with our Blog Discover the latest insights, trends, and innovations in Artificial Intelligence (AI), Machine Learning (ML), and Data Science through our informative and engaging Hubspot blog. Gain a deep understanding of how these transformative technologies are shaping industries and revolutionizing the way we work. Stay updated with cutting-edge advancements, practical applications, and real-world use.

Wednesday, 18 March 2026

Interpolation in Positional Encodings and Using YaRN for Larger Context Window

Transformer models are trained with a fixed sequence length, but during inference, they may need to process sequences of different lengths. This poses a challenge because positional encodings are computed based on the sequence length. The model might struggle with positional encodings it hasn’t encountered during training.

The ability to handle varying sequence lengths is crucial for a model. This post explores how different positional encoding methods address this challenge.

Kick-start your project with my book Building Transformer Models From Scratch with PyTorch. It provides self-study tutorials with working code.

Let’s get started.

Interpolation in Positional Encodings and Using YaRN for Larger Context Window
Photo by enkuu smile_. Some rights reserved.

Overview

This post is divided into three parts; they are:

Interpolation and Extrapolation in Sinusoidal Encodings and RoPE
Interpolation in Learned Encodings
YaRN for Larger Context Window

Interpolation and Extrapolation in Sinusoidal Encodings and RoPE

Sinusoidal encodings excel at extrapolation due to their use of continuous functions:

$𝑃 𝐸 (𝑝, 2 𝑖) = s i n (𝑝 10000 2 𝑖 / 𝑑) 𝑃 𝐸 (𝑝, 2 𝑖 + 1) = c o s (𝑝 10000 2 𝑖 / 𝑑)$

You can simply substitute $𝑝$ with a larger value to obtain the positional encoding for a longer sequence. This is extrapolation.

Alternatively, you can use interpolation. Instead of using $𝑝$ as an integer from 0 to $𝐿 - 1$ for sequence length $𝐿$ , you can make $𝑝$ a floating point number in the same range to represent a length $𝐿' > 𝐿$ . That is:

$𝑝 = 𝐿 𝐿' 𝑝'$

where $𝑝'$ is the integer position from 0 to $𝐿' - 1$ (the actual position in the longer sequence).

These techniques also apply to RoPE.

The function generating sinusoidal positional encodings or RoPE requires no modification and can handle sequences of any length. However, you may need to fine-tune the model to ensure it can process new encodings it hasn’t seen during training. For example, the Llama model uses RoPE and was trained with a maximum sequence length of 16K. Code Llama, a programming-focused model fine-tuned from Llama, extended the sequence length to 100K tokens with just 1000 fine-tuning steps.

Interpolation in Learned Encodings

Learned positional encodings retrieve position encoding vectors from a lookup table. This means the sequence length is fixed by the table size, making extrapolation impossible. However, interpolation can still handle sequences longer than the training length. For a sequence of length $𝐿' > 𝐿$ where $𝐿$ is the original sequence length, the encoding vector for position $𝑝' = 0, \dots, 𝐿' - 1$ is:

$𝑃 𝑝' = 𝑝 - 𝑛 𝑚 - 𝑛 𝑃 𝑚 + 𝑚 - 𝑝 𝑚 - 𝑛 𝑃 𝑛$

where $𝑝 = 𝐿 𝐿' 𝑝'$ and $𝑚, 𝑛$ are integers such that $𝑚 = 𝑛 + 1$ and $𝑛 \leq 𝑝 \leq 𝑚$ . In PyTorch:

class ExtrapolatingLearnedEncoding(nn.Module):
    def __init__(self, max_trained_len, d):
        super().__init__()
        self.max_trained_len = max_trained_len
        self.position_embeddings = nn.Embedding(max_trained_len, d)
        
    def forward(self, x)
        seq_len = x.size(1)
        if seq_len <= self.max_trained_len:
            # Normal case: use learned embeddings
            positions = torch.arange(seq_len, device=x.device)
            return x + self.position_embeddings(positions)
        else:
            # Extrapolation case: use interpolation
            positions = torch.arange(seq_len, device=x.device)
            # Interpolate between existing positions
            scale = (self.max_trained_len - 1) / (seq_len - 1)
            scaled_positions = positions * scale
            # Get floor and ceiling positions
            pos_floor = torch.floor(scaled_positions).long()
            pos_ceil = torch.ceil(scaled_positions).long()
            # Get weights for interpolation
            weights = (scaled_positions - pos_floor.float()).unsqueeze(-1)
            # Interpolate
            emb_floor = self.position_embeddings(pos_floor)
            emb_ceil = self.position_embeddings(pos_ceil)
            return x + (1 - weights) * emb_floor + weights * emb_ceil

This is a basic interpolation implementation. However, there’s no guarantee the model can handle longer sequences without performance degradation before retraining.

YaRN for Larger Context Window

RoPE is currently the most widely used positional encoding in large language models. Recent research has focused on improving RoPE’s extrapolation capabilities.

YaRN is a method that extends RoPE to handle longer sequences, proving more effective than the interpolation method described above. Recall that RoPE sinusoids are computed as:

$𝜃 𝑖 = 1 10000 2 𝑖 / 𝑑 ˆ 𝑥 (𝑖) 𝑚 = 𝑥 (𝑖) 𝑚 c o s (𝑚 𝜃 𝑖) + 𝑥 (𝑑 / 2 + 𝑖) 𝑚 s i n (𝑚 𝜃 𝑖) ˆ 𝑥 (𝑑 / 2 + 𝑖) 𝑚 = 𝑥 (𝑑 / 2 + 𝑖) 𝑚 c o s (𝑚 𝜃 𝑖) - 𝑥 (𝑖) 𝑚 s i n (𝑚 𝜃 𝑖)$

for position $𝑚$ in the sequence, where vector $𝑥 𝑚$ has dimension $𝑑$ , and $𝑥 (𝑖) 𝑚$ is the $𝑖$ -th element of vector $𝑥 𝑚$ . YaRN modifies the formula to:

$𝑠 = 𝐿' 𝐿 𝜃 𝑖 = 1 10000 2 𝑖 / 𝑑 𝑟 (𝑖) = 𝐿 2 𝜋 (10000 2 𝑖 / 𝑑) 𝛾 (𝑟) = ⎧ { { {⎨ { { {⎩ 0, i f 𝑟 < 𝛼 𝑟 - 𝛼 𝛽 - 𝛼, i f 𝛼 \leq 𝑟 \leq 𝛽 1, i f 𝑟 > 𝛽 𝜃' 𝑖 = [1 - 𝛾 (𝑟 (𝑖))] 𝜃 𝑖 𝑠 + 𝛾 (𝑟 (𝑖)) 𝜃 𝑖 \sqrt 1 𝑡 = 0.1 l n (𝑠) + 1 ˆ 𝑥 (𝑖) 𝑚 = \sqrt 1 𝑡 [𝑥 (𝑖) 𝑚 c o s (𝑚 𝜃' 𝑖) + 𝑥 (𝑑 / 2 + 𝑖) 𝑚 s i n (𝑚 𝜃' 𝑖)] ˆ 𝑥 (𝑑 / 2 + 𝑖) 𝑚 = \sqrt 1 𝑡 [𝑥 (𝑑 / 2 + 𝑖) 𝑚 c o s (𝑚 𝜃' 𝑖) - 𝑥 (𝑖) 𝑚 s i n (𝑚 𝜃' 𝑖)]$

In code:

import torch
import numpy as np
 
def rotate_half(x):
    x1, x2 = x.chunk(2, dim=-1)
    return torch.cat((-x2, x1), dim=-1)
 
def apply_rotary_pos_emb(x, cos, sin):
    return (x * cos) + (rotate_half(x) * sin)
 
class YaRN(nn.Module):
    def __init__(self, dim, orig_seq_len=512, scale=4, alpha=1, beta=32):
        super().__init__()
        N = 10000
        pos_freq = N ** (torch.arange(0, dim, 2).float() / dim)
        inv_freq_extrapolation = 1. / pos_freq
        inv_freq_interpolation = 1. / (scale * pos_freq)
        
        low = dim * np.log(orig_seq_len / (2*np.pi*beta)) / (2*np.log(N))
        high = dim * np.log(orig_seq_len / (2*np.pi*alpha)) / (2*np.log(N))
        low = max(np.floor(low), 0)
        high = min(np.ceil(high), dim-1)
 
        linear_func = (torch.arange(dim // 2).float() - low) / (high - low)
        ramp_func = torch.clamp(linear_func, 0, 1)
        inv_freq_factor = 1 - ramp_func
        inv_freq = inv_freq_interpolation * (1-inv_freq_factor) + inv_freq_extrapolation * inv_freq_factor
 
        # Original RoPE multiplied with a scaling factor
        scaling_factor = 0.1 * np.log(scale) + 1.0
        position = torch.arange(orig_seq_len * scale).float()
        sinusoid_inp = torch.outer(position, inv_freq)
        self.register_buffer("cos", sinusoid_inp.cos() * scaling_factor)
        self.register_buffer("sin", sinusoid_inp.sin() * scaling_factor)
        
    def forward(self, x, seq_len=None):
        if seq_len is None:
            seq_len = x.size(1)
        cos = self.cos[:seq_len].view(1, seq_len, 1, -1)
        sin = self.sin[:seq_len].view(1, seq_len, 1, -1)
        return apply_rotary_pos_emb(x, cos, sin)

YaRN’s key innovation is scaling RoPE sinusoid frequencies unevenly when expanding sequence length from $𝐿$ to $𝐿'$ . This approach is called “NTK-by-parts” interpolation.

Consider the $c o s (𝑚 𝜃 𝑖)$ term in the RoPE formula, where $𝑚$ ranges from 0 to $𝐿' - 1$ for the new sequence length $𝐿'$ . In inv_freq_interpolation, a factor of $1 / 𝑠 = 𝐿 / 𝐿'$ is multiplied to $𝜃 𝑖$ to create the interpolation effect. Using the original $c o s (𝑚 𝜃 𝑖)$ with a larger $𝑚$ constitutes extrapolation.

In NTK-by-part, instead of $c o s (𝑚 𝜃 𝑖)$ , you use $c o s (𝑚 𝜃' 𝑖)$ where $𝜃' 𝑖$ blends interpolation and extrapolation. The weight between interpolation and extrapolation follows the formula above, implemented as inv_freq in the code.

YaRN improves upon NTK-by-part by adding a scaling factor $\sqrt 1 / 𝑡$ . This enhancement improves model performance, measured by lower perplexity (higher accuracy in next-token prediction) under longer context lengths.

Summary

In this post, you learned how models trained with shorter context lengths can process longer input sequences. Specifically:

Sinusoidal encodings and RoPE can be easily extrapolated
Learned encodings only support interpolation
YaRN provides an advanced method for scaling RoPE to longer sequence lengths

The goal of scaling positional encodings is to enable models to handle longer input sequences without retraining. This is not an exhaustive list, as more advanced methods continue to build upon these foundational ideas.

Artificial Intelligence , Machine Learning and Data Science Hubspot

Wednesday, 18 March 2026

Interpolation in Positional Encodings and Using YaRN for Larger Context Window

Overview

Interpolation and Extrapolation in Sinusoidal Encodings and RoPE

Interpolation in Learned Encodings

YaRN for Larger Context Window

Further Readings

Summary

No comments:

Post a Comment

Positional Encodings in Transformer Models

Report Abuse

Labels

"Donate for a Noble Cause