Artificial Intelligence , Machine Learning and Data Science Hubspot

Unlock the Power of Artificial Intelligence, Machine Learning, and Data Science with our Blog Discover the latest insights, trends, and innovations in Artificial Intelligence (AI), Machine Learning (ML), and Data Science through our informative and engaging Hubspot blog. Gain a deep understanding of how these transformative technologies are shaping industries and revolutionizing the way we work. Stay updated with cutting-edge advancements, practical applications, and real-world use.

Wednesday, 21 February 2024

How to Implement Scaled Dot-Product Attention from Scratch in TensorFlow and Keras

Having familiarized ourselves with the theory behind the Transformer model and its attention mechanism, we’ll start our journey of implementing a complete Transformer model by first seeing how to implement the scaled-dot product attention. The scaled dot-product attention is an integral part of the multi-head attention, which, in turn, is an important component of both the Transformer encoder and decoder. Our end goal will be to apply the complete Transformer model to Natural Language Processing (NLP).

In this tutorial, you will discover how to implement scaled dot-product attention from scratch in TensorFlow and Keras.

After completing this tutorial, you will know:

The operations that form part of the scaled dot-product attention mechanism
How to implement the scaled dot-product attention mechanism from scratch

Tutorial Overview

This tutorial is divided into three parts; they are:

Recap of the Transformer Architecture
- The Transformer Scaled Dot-Product Attention
Implementing the Scaled Dot-Product Attention From Scratch
Testing Out the Code

Prerequisites

For this tutorial, we assume that you are already familiar with:

The concept of attention
The attention mechanism
The Transfomer attention mechanism
The Transformer model

Recap of the Transformer Architecture

Recall having seen that the Transformer architecture follows an encoder-decoder structure. The encoder, on the left-hand side, is tasked with mapping an input sequence to a sequence of continuous representations; the decoder, on the right-hand side, receives the output of the encoder together with the decoder output at the previous time step to generate an output sequence.

The encoder-decoder structure of the Transformer architecture
Taken from “Attention Is All You Need“

In generating an output sequence, the Transformer does not rely on recurrence and convolutions.

You have seen that the decoder part of the Transformer shares many similarities in its architecture with the encoder. One of the core components that both the encoder and decoder share within their multi-head attention blocks is the scaled dot-product attention.

The Transformer Scaled Dot-Product Attention

First, recall the queries, keys, and values as the important components you will work with.

In the encoder stage, they each carry the same input sequence after this has been embedded and augmented by positional information. Similarly, on the decoder side, the queries, keys, and values fed into the first attention block represent the same target sequence after this would have also been embedded and augmented by positional information. The second attention block of the decoder receives the encoder output in the form of keys and values and the normalized output of the first attention block as the queries. The dimensionality of the queries and keys is denoted by $�_{�}$ , whereas the dimensionality of the values is denoted by $�_{�}$ .

The scaled dot-product attention receives these queries, keys, and values as inputs and first computes the dot-product of the queries with the keys. The result is subsequently scaled by the square root of $�_{�}$ , producing the attention scores. They are then fed into a softmax function, obtaining a set of attention weights. Finally, the attention weights are used to scale the values through a weighted multiplication operation. This entire process can be explained mathematically as follows, where $�$ , $�$ and $�$ denote the queries, keys, and values, respectively:

$attention (�, �, �) = softmax (\frac{� �^{�}}{\sqrt{�_{�}}}) �$

Each multi-head attention block in the Transformer model implements a scaled dot-product attention operation as shown below:

Scaled dot-product attention and multi-head attention
Taken from “Attention Is All You Need“

You may note that the scaled dot-product attention can also apply a mask to the attention scores before feeding them into the softmax function.

Since the word embeddings are zero-padded to a specific sequence length, a padding mask needs to be introduced in order to prevent the zero tokens from being processed along with the input in both the encoder and decoder stages. Furthermore, a look-ahead mask is also required to prevent the decoder from attending to succeeding words, such that the prediction for a particular word can only depend on known outputs for the words that come before it.

These look-ahead and padding masks are applied inside the scaled dot-product attention set to - $\infty$ all the values in the input to the softmax function that should not be considered. For each of these large negative inputs, the softmax function will, in turn, produce an output value that is close to zero, effectively masking them out. The use of these masks will become clearer when you progress to the implementation of the encoder and decoder blocks in separate tutorials.

For the time being, let’s see how to implement the scaled dot-product attention from scratch in TensorFlow and Keras.

Implementing the Scaled Dot-Product Attention from Scratch

For this purpose, you will create a class called DotProductAttention that inherits from the Layer base class in Keras.

In it, you will create the class method, call(), that takes as input arguments the queries, keys, and values, as well as the dimensionality, $�_{�}$ , and a mask (that defaults to None):

1
2
3
4
5
6
class DotProductAttention(Layer):
    def __init__(self, **kwargs):
        super(DotProductAttention, self).__init__(**kwargs)
 
    def call(self, queries, keys, values, d_k, mask=None):
        ...

The first step is to perform a dot-product operation between the queries and the keys, transposing the latter. The result will be scaled through a division by the square root of $�_{�}$ . You will add the following line of code to the call() class method:

1
2
3
...
scores = matmul(queries, keys, transpose_b=True) / sqrt(d_k)
...

Next, you will check whether the mask argument has been set to a value that is not the default None.

The mask will contain either 0 values to indicate that the corresponding token in the input sequence should be considered in the computations or a 1 to indicate otherwise. The mask will be multiplied by -1e9 to set the 1 values to large negative numbers (remember having mentioned this in the previous section), subsequently applied to the attention scores:

1
2
3
4
...
if mask is not None:
    scores += -1e9 * mask
...

The attention scores will then be passed through a softmax function to generate the attention weights:

1
2
3
...
weights = softmax(scores)
...

The final step weights the values with the computed attention weights through another dot-product operation:

1
2
...
return matmul(weights, values)

The complete code listing is as follows:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
from tensorflow import matmul, math, cast, float32
from tensorflow.keras.layers import Layer
from keras.backend import softmax
 
# Implementing the Scaled-Dot Product Attention
class DotProductAttention(Layer):
    def __init__(self, **kwargs):
        super(DotProductAttention, self).__init__(**kwargs)
 
    def call(self, queries, keys, values, d_k, mask=None):
        # Scoring the queries against the keys after transposing the latter, and scaling
        scores = matmul(queries, keys, transpose_b=True) / math.sqrt(cast(d_k, float32))
 
        # Apply mask to the attention scores
        if mask is not None:
            scores += -1e9 * mask
 
        # Computing the weights by a softmax operation
        weights = softmax(scores)
 
        # Computing the attention by a weighted sum of the value vectors
        return matmul(weights, values)

Testing Out the Code

You will be working with the parameter values specified in the paper, Attention Is All You Need, by Vaswani et al. (2017):

1
2
3
4
d_k = 64  # Dimensionality of the linearly projected queries and keys
d_v = 64  # Dimensionality of the linearly projected values
batch_size = 64  # Batch size from the training process
...

As for the sequence length and the queries, keys, and values, you will be working with dummy data for the time being until you arrive at the stage of training the complete Transformer model in a separate tutorial, at which point you will use actual sentences. Similarly, for the mask, leave it set to its default value for the time being:

1
2
3
4
5
6
7
...
input_seq_length = 5  # Maximum length of the input sequence
 
queries = random.random((batch_size, input_seq_length, d_k))
keys = random.random((batch_size, input_seq_length, d_k))
values = random.random((batch_size, input_seq_length, d_v))
...

In the complete Transformer model, values for the sequence length and the queries, keys, and values will be obtained through a process of word tokenization and embedding. You will be covering this in a separate tutorial.

Returning to the testing procedure, the next step is to create a new instance of the DotProductAttention class, assigning its output to the attention variable:

1
2
3
...
attention = DotProductAttention()
...

Since the DotProductAttention class inherits from the Layer base class, the call() method of the former will be automatically invoked by the magic __call()__ method of the latter. The final step is to feed in the input arguments and print the result:

1
2
...
print(attention(queries, keys, values, d_k))

Tying everything together produces the following code listing:

1
2
3
4
5
6
7
8
9
10
11
12
13
from numpy import random
 
input_seq_length = 5  # Maximum length of the input sequence
d_k = 64  # Dimensionality of the linearly projected queries and keys
d_v = 64  # Dimensionality of the linearly projected values
batch_size = 64  # Batch size from the training process
 
queries = random.random((batch_size, input_seq_length, d_k))
keys = random.random((batch_size, input_seq_length, d_k))
values = random.random((batch_size, input_seq_length, d_v))
 
attention = DotProductAttention()
print(attention(queries, keys, values, d_k))

Running this code produces an output of shape (batch size, sequence length, values dimensionality). Note that you will likely see a different output due to the random initialization of the queries, keys, and values.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
tf.Tensor(
[[[0.60413814 0.52436507 0.46551135 ... 0.5260341  0.33879933 0.43999898]
  [0.60433316 0.52383804 0.465411   ... 0.5262608  0.33915892 0.43782598]
  [0.62321603 0.5349194  0.46824688 ... 0.531323   0.34432083 0.43554053]
  [0.60013235 0.54162943 0.47391182 ... 0.53600514 0.33722004 0.4192218 ]
  [0.6295709  0.53511244 0.46552944 ... 0.5317217  0.3462567  0.43129003]]
 ...
 
[[0.20291057 0.18463902 0.641182   ... 0.4706118  0.4194418  0.39908117]
  [0.19932748 0.18717204 0.64831126 ... 0.48373622 0.3995132  0.37968236]
  [0.20611541 0.18079443 0.6374859  ... 0.48258874 0.41704425 0.4016996 ]
  [0.19703123 0.18210654 0.6400498  ... 0.47037745 0.4257752  0.3962079 ]
  [0.19237372 0.18474475 0.64944196 ... 0.49497223 0.38804317 0.36352912]]], 
shape=(64, 5, 64), dtype=float32)

Summary

In this tutorial, you discovered how to implement scaled dot-product attention from scratch in TensorFlow and Keras.

Specifically, you learned:

The operations that form part of the scaled dot-product attention mechanism
How to implement the scaled dot-product attention mechanism from scratch

Do you have any questions?
Ask your questions in the comments below, and I will do my best to answer.

Artificial Intelligence , Machine Learning and Data Science Hubspot

Wednesday, 21 February 2024

How to Implement Scaled Dot-Product Attention from Scratch in TensorFlow and Keras

Tutorial Overview

Prerequisites

Recap of the Transformer Architecture

The Transformer Scaled Dot-Product Attention

Implementing the Scaled Dot-Product Attention from Scratch

Testing Out the Code

Further Reading

Books

Papers

Summary

No comments:

Post a Comment

Report Abuse

Labels

"Donate for a Noble Cause