Before the introduction of the Transformer model, the use of attention for neural machine translation was implemented by RNN-based encoder-decoder architectures. The Transformer model revolutionized the implementation of attention by dispensing with recurrence and convolutions and, alternatively, relying solely on a self-attention mechanism.
We will first focus on the Transformer attention mechanism in this tutorial and subsequently review the Transformer model in a separate one.
In this tutorial, you will discover the Transformer attention mechanism for neural machine translation.
After completing this tutorial, you will know:
- How the Transformer attention differed from its predecessors
- How the Transformer computes a scaled-dot product attention
- How the Transformer computes multi-head attention
Kick-start your project with my book Building Transformer Models with Attention. It provides self-study tutorials with working code to guide you into building a fully-working transformer model that can
translate sentences from one language to another...
Let’s get started.

The Transformer attention mechanism
Photo by Andreas Gücklhorn, some rights reserved.
Tutorial Overview
This tutorial is divided into two parts; they are:
- Introduction to the Transformer Attention
- The Transformer Attention
- Scaled-Dot Product Attention
- Multi-Head Attention
Prerequisites
For this tutorial, we assume that you are already familiar with:
- The concept of attention
- The attention mechanism
- The Bahdanau attention mechanism
- The Luong attention mechanism
Introduction to the Transformer Attention
Thus far, you have familiarized yourself with using an attention mechanism in conjunction with an RNN-based encoder-decoder architecture. Two of the most popular models that implement attention in this manner have been those proposed by Bahdanau et al. (2014) and Luong et al. (2015).
The Transformer architecture revolutionized the use of attention by dispensing with recurrence and convolutions, on which the formers had extensively relied.
… the Transformer is the first transduction model relying entirely on self-attention to compute representations of its input and output without using sequence-aligned RNNs or convolution.
– Attention Is All You Need, 2017.
In their paper, “Attention Is All You Need,” Vaswani et al. (2017) explain that the Transformer model, alternatively, relies solely on the use of self-attention, where the representation of a sequence (or sentence) is computed by relating different words in the same sequence.
Self-attention, sometimes called intra-attention, is an attention mechanism relating different positions of a single sequence in order to compute a representation of the sequence.
– Attention Is All You Need, 2017.
The Transformer Attention
The main components used by the Transformer attention are the following:
and denoting vectors of dimension, , containing the queries and keys, respectively denoting a vector of dimension, , containing the values , , and denoting matrices packing together sets of queries, keys, and values, respectively. , and denoting projection matrices that are used in generating different subspace representations of the query, key, and value matrices denoting a projection matrix for the multi-head output
In essence, the attention function can be considered a mapping between a query and a set of key-value pairs to an output.
The output is computed as a weighted sum of the values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key.
– Attention Is All You Need, 2017.
Vaswani et al. propose a scaled dot-product attention and then build on it to propose multi-head attention. Within the context of neural machine translation, the query, keys, and values that are used as inputs to these attention mechanisms are different projections of the same input sentence.
Intuitively, therefore, the proposed attention mechanisms implement self-attention by capturing the relationships between the different elements (in this case, the words) of the same sentence.
Want to Get Started With Building Transformer Models with Attention?
Take my free 12-day email crash course now (with sample code).
Click to sign-up and also get a free PDF Ebook version of the course.
Scaled Dot-Product Attention
The Transformer implements a scaled dot-product attention, which follows the procedure of the general attention mechanism that you had previously seen.
As the name suggests, the scaled dot-product attention first computes a dot product for each query,

Scaled dot-product attention
Taken from “Attention Is All You Need“
In practice, the computations performed by the scaled dot-product attention can be efficiently applied to the entire set of queries simultaneously. In order to do so, the matrices—
Vaswani et al. explain that their scaled dot-product attention is identical to the multiplicative attention of Luong et al. (2015), except for the added scaling factor of
This scaling factor was introduced to counteract the effect of having the dot products grow large in magnitude for large values of
Vaswani et al. further explain that their choice of opting for multiplicative attention instead of the additive attention of Bahdanau et al. (2014) was based on the computational efficiency associated with the former.
… dot-product attention is much faster and more space-efficient in practice since it can be implemented using highly optimized matrix multiplication code.
– Attention Is All You Need, 2017.
Therefore, the step-by-step procedure for computing the scaled-dot product attention is the following:
- Compute the alignment scores by multiplying the set of queries packed in the matrix,
, with the keys in the matrix, . If the matrix, , is of the size , and the matrix, , is of the size, , then the resulting matrix will be of the size :
- Scale each of the alignment scores by
:
- And follow the scaling process by applying a softmax operation in order to obtain a set of weights:
- Finally, apply the resulting weights to the values in the matrix,
, of the size, :
Multi-Head Attention
Building on their single attention function that takes matrices,
Their multi-head attention mechanism linearly projects the queries, keys, and values

Multi-head attention
Taken from “Attention Is All You Need“
The idea behind multi-head attention is to allow the attention function to extract information from different representation subspaces, which would otherwise be impossible with a single attention head.
The multi-head attention function can be represented as follows:
Here, each
The step-by-step procedure for computing multi-head attention is, therefore, the following:
- Compute the linearly projected versions of the queries, keys, and values through multiplication with the respective weight matrices,
, , and , one for each .
- Apply the single attention function for each head by (1) multiplying the queries and keys matrices, (2) applying the scaling and softmax operations, and (3) weighting the values matrix to generate an output for each head.
- Concatenate the outputs of the heads,
, .
- Apply a linear projection to the concatenated output through multiplication with the weight matrix,
, to generate the final result.
Further Reading
This section provides more resources on the topic if you are looking to go deeper.
Books
Papers
- Attention Is All You Need, 2017.
- Neural Machine Translation by Jointly Learning to Align and Translate, 2014.
- Effective Approaches to Attention-based Neural Machine Translation, 2015.
Summary
In this tutorial, you discovered the Transformer attention mechanism for neural machine translation.
Specifically, you learned:
- How the Transformer attention differed from its predecessors.
- How the Transformer computes a scaled-dot product attention.
- How the Transformer computes multi-head attention.
Do you have any questions?
Ask your questions in the comments below, and I will do my best to answer.

No comments:
Post a Comment