Having familiarized ourselves with the theory behind the Transformer model and its attention mechanism, we’ll start our journey of implementing a complete Transformer model by first seeing how to implement the scaled-dot product attention. The scaled dot-product attention is an integral part of the multi-head attention, which, in turn, is an important component of both the Transformer encoder and decoder. Our end goal will be to apply the complete Transformer model to Natural Language Processing (NLP).
In this tutorial, you will discover how to implement scaled dot-product attention from scratch in TensorFlow and Keras.
After completing this tutorial, you will know:
- The operations that form part of the scaled dot-product attention mechanism
- How to implement the scaled dot-product attention mechanism from scratch
Tutorial Overview
This tutorial is divided into three parts; they are:
- Recap of the Transformer Architecture
- The Transformer Scaled Dot-Product Attention
- Implementing the Scaled Dot-Product Attention From Scratch
- Testing Out the Code
Prerequisites
For this tutorial, we assume that you are already familiar with:
- The concept of attention
- The attention mechanism
- The Transfomer attention mechanism
- The Transformer model
Recap of the Transformer Architecture
Recall having seen that the Transformer architecture follows an encoder-decoder structure. The encoder, on the left-hand side, is tasked with mapping an input sequence to a sequence of continuous representations; the decoder, on the right-hand side, receives the output of the encoder together with the decoder output at the previous time step to generate an output sequence.

The encoder-decoder structure of the Transformer architecture
Taken from “Attention Is All You Need“
In generating an output sequence, the Transformer does not rely on recurrence and convolutions.
You have seen that the decoder part of the Transformer shares many similarities in its architecture with the encoder. One of the core components that both the encoder and decoder share within their multi-head attention blocks is the scaled dot-product attention.
The Transformer Scaled Dot-Product Attention
First, recall the queries, keys, and values as the important components you will work with.
In the encoder stage, they each carry the same input sequence after this has been embedded and augmented by positional information. Similarly, on the decoder side, the queries, keys, and values fed into the first attention block represent the same target sequence after this would have also been embedded and augmented by positional information. The second attention block of the decoder receives the encoder output in the form of keys and values and the normalized output of the first attention block as the queries. The dimensionality of the queries and keys is denoted by
The scaled dot-product attention receives these queries, keys, and values as inputs and first computes the dot-product of the queries with the keys. The result is subsequently scaled by the square root of
Each multi-head attention block in the Transformer model implements a scaled dot-product attention operation as shown below:

Scaled dot-product attention and multi-head attention
Taken from “Attention Is All You Need“
You may note that the scaled dot-product attention can also apply a mask to the attention scores before feeding them into the softmax function.
Since the word embeddings are zero-padded to a specific sequence length, a padding mask needs to be introduced in order to prevent the zero tokens from being processed along with the input in both the encoder and decoder stages. Furthermore, a look-ahead mask is also required to prevent the decoder from attending to succeeding words, such that the prediction for a particular word can only depend on known outputs for the words that come before it.
These look-ahead and padding masks are applied inside the scaled dot-product attention set to -
For the time being, let’s see how to implement the scaled dot-product attention from scratch in TensorFlow and Keras.
Implementing the Scaled Dot-Product Attention from Scratch
For this purpose, you will create a class called DotProductAttention that inherits from the Layer base class in Keras.
In it, you will create the class method, call(), that takes as input arguments the queries, keys, and values, as well as the dimensionality, None):
The first step is to perform a dot-product operation between the queries and the keys, transposing the latter. The result will be scaled through a division by the square root of call() class method:
Next, you will check whether the mask argument has been set to a value that is not the default None.
The mask will contain either 0 values to indicate that the corresponding token in the input sequence should be considered in the computations or a 1 to indicate otherwise. The mask will be multiplied by -1e9 to set the 1 values to large negative numbers (remember having mentioned this in the previous section), subsequently applied to the attention scores:
The attention scores will then be passed through a softmax function to generate the attention weights:
The final step weights the values with the computed attention weights through another dot-product operation:
The complete code listing is as follows:
Testing Out the Code
You will be working with the parameter values specified in the paper, Attention Is All You Need, by Vaswani et al. (2017):
As for the sequence length and the queries, keys, and values, you will be working with dummy data for the time being until you arrive at the stage of training the complete Transformer model in a separate tutorial, at which point you will use actual sentences. Similarly, for the mask, leave it set to its default value for the time being:
In the complete Transformer model, values for the sequence length and the queries, keys, and values will be obtained through a process of word tokenization and embedding. You will be covering this in a separate tutorial.
Returning to the testing procedure, the next step is to create a new instance of the DotProductAttention class, assigning its output to the attention variable:
Since the DotProductAttention class inherits from the Layer base class, the call() method of the former will be automatically invoked by the magic __call()__ method of the latter. The final step is to feed in the input arguments and print the result:
Tying everything together produces the following code listing:
Running this code produces an output of shape (batch size, sequence length, values dimensionality). Note that you will likely see a different output due to the random initialization of the queries, keys, and values.
Further Reading
This section provides more resources on the topic if you are looking to go deeper.
Books
Papers
Summary
In this tutorial, you discovered how to implement scaled dot-product attention from scratch in TensorFlow and Keras.
Specifically, you learned:
- The operations that form part of the scaled dot-product attention mechanism
- How to implement the scaled dot-product attention mechanism from scratch
Do you have any questions?
Ask your questions in the comments below, and I will do my best to answer.

No comments:
Post a Comment