Artificial Intelligence , Machine Learning and Data Science Hubspot

Unlock the Power of Artificial Intelligence, Machine Learning, and Data Science with our Blog Discover the latest insights, trends, and innovations in Artificial Intelligence (AI), Machine Learning (ML), and Data Science through our informative and engaging Hubspot blog. Gain a deep understanding of how these transformative technologies are shaping industries and revolutionizing the way we work. Stay updated with cutting-edge advancements, practical applications, and real-world use.

Tuesday, 5 November 2024

How to Develop an Encoder-Decoder Model with Attention in Keras

The encoder-decoder architecture for recurrent neural networks is proving to be powerful on a host of sequence-to-sequence prediction problems in the field of natural language processing such as machine translation and caption generation.

Attention is a mechanism that addresses a limitation of the encoder-decoder architecture on long sequences, and that in general speeds up the learning and lifts the skill of the model no sequence to sequence prediction problems.

In this tutorial, you will discover how to develop an encoder-decoder recurrent neural network with attention in Python with Keras.

After completing this tutorial, you will know:

How to design a small and configurable problem to evaluate encoder-decoder recurrent neural networks with and without attention.
How to design and evaluate an encoder-decoder network with and without attention for the sequence prediction problem.
How to robustly compare the performance of encoder-decoder networks with and without attention.

Kick-start your project with my new book Long Short-Term Memory Networks With Python, including step-by-step tutorials and the Python source code files for all examples.

Let’s get started.

Note May/2020: The underlying APIs have changed and this tutorial may no longer be current. You may require older versions of Keras and TensorFlow, e.g. Keras 2 and TF 1.

How to Develop an Encoder-Decoder Model with Attention for Sequence-to-Sequence Prediction in Keras
Photo by Angela and Andrew, some rights reserved.

Tutorial Overview

This tutorial is divided into 6 parts; they are:

Encoder-Decoder with Attention
Test Problem for Attention
Encoder-Decoder without Attention
Custom Keras Attention Layer
Encoder-Decoder with Attention
Comparison of Models

Python Environment

This tutorial assumes you have a Python 3 SciPy environment installed.

You must have Keras (2.0 or higher) installed with either the TensorFlow or Theano backend.

The tutorial also assumes you have scikit-learn, Pandas, NumPy, and Matplotlib installed.

If you need help with your environment, see this post:

How to Setup a Python Environment for Machine Learning and Deep Learning with Anaconda

Encoder-Decoder with Attention

The encoder-decoder model for recurrent neural networks is an architecture for sequence-to-sequence prediction problems.

It is comprised of two sub-models, as its name suggests:

Encoder: The encoder is responsible for stepping through the input time steps and encoding the entire sequence into a fixed length vector called a context vector.
Decoder: The decoder is responsible for stepping through the output time steps while reading from the context vector.

A problem with the architecture is that performance is poor on long input or output sequences. The reason is believed to be because of the fixed-sized internal representation used by the encoder.

Attention is an extension to the architecture that addresses this limitation. It works by first providing a richer context from the encoder to the decoder and a learning mechanism where the decoder can learn where to pay attention in the richer encoding when predicting each time step in the output sequence.

For more on attention in the encoder-decoder architecture, see the posts:

You Need To Learn And Grow With Your Colleagues.

You don't need to be your own boss, you need to learn and grow with us at Thales.

Test Problem for Attention

Before we develop models with attention, we will first define a contrived scalable test problem that we can use to determine whether attention is providing any benefit.

In this problem, we will generate sequences of random integers as input and matching output sequences comprised of a subset of the integers in the input sequence.

For example, an input sequence might be [1, 6, 2, 7, 3] and the expected output sequence might be the first two random integers in the sequence [1, 6].

We will define the problem such that the input and output sequences are the same length and pad the output sequences with “0” values as needed.

First, we need a function to generate sequences of random integers. We will use the Python randint() function to generate random integers between 0 and a maximum value and use this range as the cardinality for the problem (e.g. the number of features or an axis of difficulty).

The function generate_sequence() below will generate a random sequence of integers to a fixed length and with the specified cardinality.

1
2
3
4
5
6
7
8
9
from random import randint
 
# generate a sequence of random integers
def generate_sequence(length, n_unique):
 return [randint(0, n_unique-1) for _ in range(length)]
 
# generate random sequence
sequence = generate_sequence(5, 50)
print(sequence)

Running this example generates a sequence of 5 time steps where each value in the sequence is a random integer between 0 and 49.

1
[43, 3, 28, 34, 33]

Next, we need a function to one hot encode the discrete integer values into binary vectors.

If a cardinality of 50 is used, then each integer will be represented by a 50-element vector of 0 values and 1 in the index of the specified integer value.

The one_hot_encode() function below will one hot encode a given sequence of integers.

1
2
3
4
5
6
7
8
# one hot encode sequence
def one_hot_encode(sequence, n_unique):
 encoding = list()
 for value in sequence:
 vector = [0 for _ in range(n_unique)]
 vector[value] = 1
 encoding.append(vector)
 return array(encoding)

We also need to be able to decode an encoded sequence. This will be needed to turn a prediction from the model or an encoded expected sequence back into a sequence of integers we can read and evaluate.

The one_hot_decode() function below will decode a one hot encoded sequence back into a sequence of integers.

1
2
3
# decode a one hot encoded string
def one_hot_decode(encoded_seq):
 return [argmax(vector) for vector in encoded_seq]

We can test out these operations in the example below.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
from random import randint
from numpy import array
from numpy import argmax
 
# generate a sequence of random integers
def generate_sequence(length, n_unique):
 return [randint(0, n_unique-1) for _ in range(length)]
 
# one hot encode sequence
def one_hot_encode(sequence, n_unique):
 encoding = list()
 for value in sequence:
 vector = [0 for _ in range(n_unique)]
 vector[value] = 1
 encoding.append(vector)
 return array(encoding)
 
# decode a one hot encoded string
def one_hot_decode(encoded_seq):
 return [argmax(vector) for vector in encoded_seq]
 
# generate random sequence
sequence = generate_sequence(5, 50)
print(sequence)
# one hot encode
encoded = one_hot_encode(sequence, 50)
print(encoded)
# decode
decoded = one_hot_decode(encoded)
print(decoded)

Running the example first prints a randomly generated sequence, then the one hot encoded version, then finally the decoded sequence again.

1
2
3
4
5
6
7
[3, 18, 32, 11, 36]
[[0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0]]
[3, 18, 32, 11, 36]

Finally, we need a function that can create input and output pairs of sequences to train and evaluate a model.

The function below named get_pair() will return one input and output sequence pair given a specified input length, output length, and cardinality. Both input and output sequences are the same length, the length of the input sequence, but the output sequence will be taken as the first n characters of the input sequence and padded with zero values to the required length.

The sequences of integers are then encoded then reshaped into a 3D format required for the recurrent neural network, with the dimensions: samples, time steps, and features. In this case, samples is always 1 as we are only generating one input-output pair, the time steps is the input sequence length and features is the cardinality of each time step.

1
2
3
4
5
6
7
8
9
10
11
12
# prepare data for the LSTM
def get_pair(n_in, n_out, n_unique):
 # generate random sequence
 sequence_in = generate_sequence(n_in, n_unique)
 sequence_out = sequence_in[:n_out] + [0 for _ in range(n_in-n_out)]
 # one hot encode
 X = one_hot_encode(sequence_in, n_unique)
 y = one_hot_encode(sequence_out, n_unique)
 # reshape as 3D
 X = X.reshape((1, X.shape[0], X.shape[1]))
 y = y.reshape((1, y.shape[0], y.shape[1]))
 return X,y

We can put this all together and demonstrate the data preparation code.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
from random import randint
from numpy import array
from numpy import argmax
 
# generate a sequence of random integers
def generate_sequence(length, n_unique):
 return [randint(0, n_unique-1) for _ in range(length)]
 
# one hot encode sequence
def one_hot_encode(sequence, n_unique):
 encoding = list()
 for value in sequence:
 vector = [0 for _ in range(n_unique)]
 vector[value] = 1
 encoding.append(vector)
 return array(encoding)
 
# decode a one hot encoded string
def one_hot_decode(encoded_seq):
 return [argmax(vector) for vector in encoded_seq]
 
# prepare data for the LSTM
def get_pair(n_in, n_out, n_unique):
 # generate random sequence
 sequence_in = generate_sequence(n_in, n_unique)
 sequence_out = sequence_in[:n_out] + [0 for _ in range(n_in-n_out)]
 # one hot encode
 X = one_hot_encode(sequence_in, n_unique)
 y = one_hot_encode(sequence_out, n_unique)
 # reshape as 3D
 X = X.reshape((1, X.shape[0], X.shape[1]))
 y = y.reshape((1, y.shape[0], y.shape[1]))
 return X,y
 
# generate random sequence
X, y = get_pair(5, 2, 50)
print(X.shape, y.shape)
print('X=%s, y=%s' % (one_hot_decode(X[0]), one_hot_decode(y[0])))

Running the example generates a single input-output pair and prints the shape of both arrays.

The generated pair is then printed in a decoded form where we can see that the first two integers of the sequence are reproduced in the output sequence followed by a padding of zero values.

1
2
(1, 5, 50) (1, 5, 50)
X=[12, 20, 36, 40, 12], y=[12, 20, 0, 0, 0]

Encoder-Decoder Without Attention

In this section, we will develop a baseline in performance on the problem with an encoder-decoder model without attention.

We will fix the problem definition at input and output sequences of 5 time steps, the first 2 elements of the input sequence in the output sequence and a cardinality of 50.

1
2
3
4
# configure problem
n_features = 50
n_timesteps_in = 5
n_timesteps_out = 2

We can develop a simple encoder-decoder model in Keras by taking the output from an encoder LSTM model, repeating it n times for the number of timesteps in the output sequence, then using a decoder to predict the output sequence.

For more detail on how to define an encoder-decoder architecture in Keras, see the post:

Encoder-Decoder Long Short-Term Memory Networks

We will configure the encoder and decoder with the same number of units, in this case 150. We will use the efficient Adam implementation of gradient descent and optimize the categorical cross entropy loss function, given that the problem is technically a multi-class classification problem.

The configuration for the model was found after a little trial and error and is by no means optimized.

The code for an encoder-decoder architecture in Keras is listed below.

1
2
3
4
5
6
7
# define model
model = Sequential()
model.add(LSTM(150, input_shape=(n_timesteps_in, n_features)))
model.add(RepeatVector(n_timesteps_in))
model.add(LSTM(150, return_sequences=True))
model.add(TimeDistributed(Dense(n_features, activation='softmax')))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

We will train the model on 5,000 random input-output pairs of integer sequences.

1
2
3
4
5
6
# train LSTM
for epoch in range(5000):
 # generate new random sequence
 X,y = get_pair(n_timesteps_in, n_timesteps_out, n_features)
 # fit model for one epoch on this sequence
 model.fit(X, y, epochs=1, verbose=2)

Once trained, we will evaluate the model on 100 new randomly generated integer sequences and only mark a prediction correct when the entire output sequence matches the expected value.

1
2
3
4
5
6
7
8
# evaluate LSTM
total, correct = 100, 0
for _ in range(total):
 X,y = get_pair(n_timesteps_in, n_timesteps_out, n_features)
 yhat = model.predict(X, verbose=0)
 if array_equal(one_hot_decode(y[0]), one_hot_decode(yhat[0])):
 correct += 1
print('Accuracy: %.2f%%' % (float(correct)/float(total)*100.0))

Finally, we will print 10 examples of expected output sequences and sequences predicted by the model.

Putting all of this together, the complete example is listed below.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
from random import randint
from numpy import array
from numpy import argmax
from numpy import array_equal
from keras.models import Sequential
from keras.layers import LSTM
from keras.layers import Dense
from keras.layers import TimeDistributed
from keras.layers import RepeatVector
 
# generate a sequence of random integers
def generate_sequence(length, n_unique):
 return [randint(0, n_unique-1) for _ in range(length)]
 
# one hot encode sequence
def one_hot_encode(sequence, n_unique):
 encoding = list()
 for value in sequence:
 vector = [0 for _ in range(n_unique)]
 vector[value] = 1
 encoding.append(vector)
 return array(encoding)
 
# decode a one hot encoded string
def one_hot_decode(encoded_seq):
 return [argmax(vector) for vector in encoded_seq]
 
# prepare data for the LSTM
def get_pair(n_in, n_out, cardinality):
 # generate random sequence
 sequence_in = generate_sequence(n_in, cardinality)
 sequence_out = sequence_in[:n_out] + [0 for _ in range(n_in-n_out)]
 # one hot encode
 X = one_hot_encode(sequence_in, cardinality)
 y = one_hot_encode(sequence_out, cardinality)
 # reshape as 3D
 X = X.reshape((1, X.shape[0], X.shape[1]))
 y = y.reshape((1, y.shape[0], y.shape[1]))
 return X,y
 
# configure problem
n_features = 50
n_timesteps_in = 5
n_timesteps_out = 2
# define model
model = Sequential()
model.add(LSTM(150, input_shape=(n_timesteps_in, n_features)))
model.add(RepeatVector(n_timesteps_in))
model.add(LSTM(150, return_sequences=True))
model.add(TimeDistributed(Dense(n_features, activation='softmax')))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
# train LSTM
for epoch in range(5000):
 # generate new random sequence
 X,y = get_pair(n_timesteps_in, n_timesteps_out, n_features)
 # fit model for one epoch on this sequence
 model.fit(X, y, epochs=1, verbose=2)
# evaluate LSTM
total, correct = 100, 0
for _ in range(total):
 X,y = get_pair(n_timesteps_in, n_timesteps_out, n_features)
 yhat = model.predict(X, verbose=0)
 if array_equal(one_hot_decode(y[0]), one_hot_decode(yhat[0])):
 correct += 1
print('Accuracy: %.2f%%' % (float(correct)/float(total)*100.0))
# spot check some examples
for _ in range(10):
 X,y = get_pair(n_timesteps_in, n_timesteps_out, n_features)
 yhat = model.predict(X, verbose=0)
 print('Expected:', one_hot_decode(y[0]), 'Predicted', one_hot_decode(yhat[0]))

Running this example will not take long, perhaps a few minutes on the CPU, no GPU is required.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

The accuracy of the model was reported at just under 20%.

1
Accuracy: 19.00%

We can see from the sample outputs that the model does get one number in the output sequence correct for most or all cases, and only struggles with the second number. All zero padding values are predicted correctly.

1
2
3
4
5
6
7
8
9
10
Expected: [47, 0, 0, 0, 0] Predicted [47, 47, 0, 0, 0]
Expected: [43, 31, 0, 0, 0] Predicted [43, 31, 0, 0, 0]
Expected: [14, 22, 0, 0, 0] Predicted [14, 14, 0, 0, 0]
Expected: [39, 31, 0, 0, 0] Predicted [39, 39, 0, 0, 0]
Expected: [6, 4, 0, 0, 0] Predicted [6, 4, 0, 0, 0]
Expected: [47, 0, 0, 0, 0] Predicted [47, 47, 0, 0, 0]
Expected: [39, 33, 0, 0, 0] Predicted [39, 39, 0, 0, 0]
Expected: [23, 2, 0, 0, 0] Predicted [23, 23, 0, 0, 0]
Expected: [19, 28, 0, 0, 0] Predicted [19, 3, 0, 0, 0]
Expected: [32, 33, 0, 0, 0] Predicted [32, 32, 0, 0, 0]

Custom Keras Attention Layer

Now we need to add attention to the encoder-decoder model.

At the time of writing, Keras does not have the capability of attention built into the library, but it is coming soon.

Until attention is officially available in Keras, we can either develop our own implementation or use an existing third-party implementation.

To speed things up, let’s use an existing third-party implementation.

Zafarali Ahmed an intern at Datalogue developed a custom layer for Keras that provides support for attention, presented in a post titled “How to Visualize Your Recurrent Neural Network with Attention in Keras” in 2017 and GitHub project called “keras-attention“.

The custom attention layer is called AttentionDecoder and is available in the custom_recurrents.py file in the GitHub project. We can reuse this code under the GNU Affero General Public License v3.0 license of the project.

A copy of the custom layer is listed below for completeness. Copy it and paste it into a new and separate file in your current working directory called ‘attention_decoder.py‘.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
import tensorflow as tf
from keras import backend as K
from keras import regularizers, constraints, initializers, activations
from keras.layers.recurrent import Recurrent, _time_distributed_dense
from keras.engine import InputSpec
 
tfPrint = lambda d, T: tf.Print(input_=T, data=[T, tf.shape(T)], message=d)
 
class AttentionDecoder(Recurrent):
 
    def __init__(self, units, output_dim,
                 activation='tanh',
                 return_probabilities=False,
                 name='AttentionDecoder',
                 kernel_initializer='glorot_uniform',
                 recurrent_initializer='orthogonal',
                 bias_initializer='zeros',
                 kernel_regularizer=None,
                 bias_regularizer=None,
                 activity_regularizer=None,
                 kernel_constraint=None,
                 bias_constraint=None,
                 **kwargs):
        """
        Implements an AttentionDecoder that takes in a sequence encoded by an
        encoder and outputs the decoded states
        :param units: dimension of the hidden state and the attention matrices
        :param output_dim: the number of labels in the output space
 
        references:
            Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio.
            "Neural machine translation by jointly learning to align and translate."
            arXiv preprint arXiv:1409.0473 (2014).
        """
        self.units = units
        self.output_dim = output_dim
        self.return_probabilities = return_probabilities
        self.activation = activations.get(activation)
        self.kernel_initializer = initializers.get(kernel_initializer)
        self.recurrent_initializer = initializers.get(recurrent_initializer)
        self.bias_initializer = initializers.get(bias_initializer)
 
        self.kernel_regularizer = regularizers.get(kernel_regularizer)
        self.recurrent_regularizer = regularizers.get(kernel_regularizer)
        self.bias_regularizer = regularizers.get(bias_regularizer)
        self.activity_regularizer = regularizers.get(activity_regularizer)
 
        self.kernel_constraint = constraints.get(kernel_constraint)
        self.recurrent_constraint = constraints.get(kernel_constraint)
        self.bias_constraint = constraints.get(bias_constraint)
 
        super(AttentionDecoder, self).__init__(**kwargs)
        self.name = name
        self.return_sequences = True  # must return sequences
 
    def build(self, input_shape):
        """
          See Appendix 2 of Bahdanau 2014, arXiv:1409.0473
          for model details that correspond to the matrices here.
        """
 
        self.batch_size, self.timesteps, self.input_dim = input_shape
 
        if self.stateful:
            super(AttentionDecoder, self).reset_states()
 
        self.states = [None, None]  # y, s
 
        """
            Matrices for creating the context vector
        """
 
        self.V_a = self.add_weight(shape=(self.units,),
                                   name='V_a',
                                   initializer=self.kernel_initializer,
                                   regularizer=self.kernel_regularizer,
                                   constraint=self.kernel_constraint)
        self.W_a = self.add_weight(shape=(self.units, self.units),
                                   name='W_a',
                                   initializer=self.kernel_initializer,
                                   regularizer=self.kernel_regularizer,
                                   constraint=self.kernel_constraint)
        self.U_a = self.add_weight(shape=(self.input_dim, self.units),
                                   name='U_a',
                                   initializer=self.kernel_initializer,
                                   regularizer=self.kernel_regularizer,
                                   constraint=self.kernel_constraint)
        self.b_a = self.add_weight(shape=(self.units,),
                                   name='b_a',
                                   initializer=self.bias_initializer,
                                   regularizer=self.bias_regularizer,
                                   constraint=self.bias_constraint)
        """
            Matrices for the r (reset) gate
        """
        self.C_r = self.add_weight(shape=(self.input_dim, self.units),
                                   name='C_r',
                                   initializer=self.recurrent_initializer,
                                   regularizer=self.recurrent_regularizer,
                                   constraint=self.recurrent_constraint)
        self.U_r = self.add_weight(shape=(self.units, self.units),
                                   name='U_r',
                                   initializer=self.recurrent_initializer,
                                   regularizer=self.recurrent_regularizer,
                                   constraint=self.recurrent_constraint)
        self.W_r = self.add_weight(shape=(self.output_dim, self.units),
                                   name='W_r',
                                   initializer=self.recurrent_initializer,
                                   regularizer=self.recurrent_regularizer,
                                   constraint=self.recurrent_constraint)
        self.b_r = self.add_weight(shape=(self.units, ),
                                   name='b_r',
                                   initializer=self.bias_initializer,
                                   regularizer=self.bias_regularizer,
                                   constraint=self.bias_constraint)
 
        """
            Matrices for the z (update) gate
        """
        self.C_z = self.add_weight(shape=(self.input_dim, self.units),
                                   name='C_z',
                                   initializer=self.recurrent_initializer,
                                   regularizer=self.recurrent_regularizer,
                                   constraint=self.recurrent_constraint)
        self.U_z = self.add_weight(shape=(self.units, self.units),
                                   name='U_z',
                                   initializer=self.recurrent_initializer,
                                   regularizer=self.recurrent_regularizer,
                                   constraint=self.recurrent_constraint)
        self.W_z = self.add_weight(shape=(self.output_dim, self.units),
                                   name='W_z',
                                   initializer=self.recurrent_initializer,
                                   regularizer=self.recurrent_regularizer,
                                   constraint=self.recurrent_constraint)
        self.b_z = self.add_weight(shape=(self.units, ),
                                   name='b_z',
                                   initializer=self.bias_initializer,
                                   regularizer=self.bias_regularizer,
                                   constraint=self.bias_constraint)
        """
            Matrices for the proposal
        """
        self.C_p = self.add_weight(shape=(self.input_dim, self.units),
                                   name='C_p',
                                   initializer=self.recurrent_initializer,
                                   regularizer=self.recurrent_regularizer,
                                   constraint=self.recurrent_constraint)
        self.U_p = self.add_weight(shape=(self.units, self.units),
                                   name='U_p',
                                   initializer=self.recurrent_initializer,
                                   regularizer=self.recurrent_regularizer,
                                   constraint=self.recurrent_constraint)
        self.W_p = self.add_weight(shape=(self.output_dim, self.units),
                                   name='W_p',
                                   initializer=self.recurrent_initializer,
                                   regularizer=self.recurrent_regularizer,
                                   constraint=self.recurrent_constraint)
        self.b_p = self.add_weight(shape=(self.units, ),
                                   name='b_p',
                                   initializer=self.bias_initializer,
                                   regularizer=self.bias_regularizer,
                                   constraint=self.bias_constraint)
        """
            Matrices for making the final prediction vector
        """
        self.C_o = self.add_weight(shape=(self.input_dim, self.output_dim),
                                   name='C_o',
                                   initializer=self.recurrent_initializer,
                                   regularizer=self.recurrent_regularizer,
                                   constraint=self.recurrent_constraint)
        self.U_o = self.add_weight(shape=(self.units, self.output_dim),
                                   name='U_o',
                                   initializer=self.recurrent_initializer,
                                   regularizer=self.recurrent_regularizer,
                                   constraint=self.recurrent_constraint)
        self.W_o = self.add_weight(shape=(self.output_dim, self.output_dim),
                                   name='W_o',
                                   initializer=self.recurrent_initializer,
                                   regularizer=self.recurrent_regularizer,
                                   constraint=self.recurrent_constraint)
        self.b_o = self.add_weight(shape=(self.output_dim, ),
                                   name='b_o',
                                   initializer=self.bias_initializer,
                                   regularizer=self.bias_regularizer,
                                   constraint=self.bias_constraint)
 
        # For creating the initial state:
        self.W_s = self.add_weight(shape=(self.input_dim, self.units),
                                   name='W_s',
                                   initializer=self.recurrent_initializer,
                                   regularizer=self.recurrent_regularizer,
                                   constraint=self.recurrent_constraint)
 
        self.input_spec = [
            InputSpec(shape=(self.batch_size, self.timesteps, self.input_dim))]
        self.built = True
 
    def call(self, x):
        # store the whole sequence so we can "attend" to it at each timestep
        self.x_seq = x
 
        # apply the a dense layer over the time dimension of the sequence
        # do it here because it doesn't depend on any previous steps
        # thefore we can save computation time:
        self._uxpb = _time_distributed_dense(self.x_seq, self.U_a, b=self.b_a,
                                             input_dim=self.input_dim,
                                             timesteps=self.timesteps,
                                             output_dim=self.units)
 
        return super(AttentionDecoder, self).call(x)
 
    def get_initial_state(self, inputs):
        # apply the matrix on the first time step to get the initial s0.
        s0 = activations.tanh(K.dot(inputs[:, 0], self.W_s))
 
        # from keras.layers.recurrent to initialize a vector of (batchsize,
        # output_dim)
        y0 = K.zeros_like(inputs)  # (samples, timesteps, input_dims)
        y0 = K.sum(y0, axis=(1, 2))  # (samples, )
        y0 = K.expand_dims(y0)  # (samples, 1)
        y0 = K.tile(y0, [1, self.output_dim])
 
        return [y0, s0]
 
    def step(self, x, states):
 
        ytm, stm = states
 
        # repeat the hidden state to the length of the sequence
        _stm = K.repeat(stm, self.timesteps)
 
        # now multiplty the weight matrix with the repeated hidden state
        _Wxstm = K.dot(_stm, self.W_a)
 
        # calculate the attention probabilities
        # this relates how much other timesteps contributed to this one.
        et = K.dot(activations.tanh(_Wxstm + self._uxpb),
                   K.expand_dims(self.V_a))
        at = K.exp(et)
        at_sum = K.sum(at, axis=1)
        at_sum_repeated = K.repeat(at_sum, self.timesteps)
        at /= at_sum_repeated  # vector of size (batchsize, timesteps, 1)
 
        # calculate the context vector
        context = K.squeeze(K.batch_dot(at, self.x_seq, axes=1), axis=1)
        # ~~~> calculate new hidden state
        # first calculate the "r" gate:
 
        rt = activations.sigmoid(
            K.dot(ytm, self.W_r)
            + K.dot(stm, self.U_r)
            + K.dot(context, self.C_r)
            + self.b_r)
 
        # now calculate the "z" gate
        zt = activations.sigmoid(
            K.dot(ytm, self.W_z)
            + K.dot(stm, self.U_z)
            + K.dot(context, self.C_z)
            + self.b_z)
 
        # calculate the proposal hidden state:
        s_tp = activations.tanh(
            K.dot(ytm, self.W_p)
            + K.dot((rt * stm), self.U_p)
            + K.dot(context, self.C_p)
            + self.b_p)
 
        # new hidden state:
        st = (1-zt)*stm + zt * s_tp
 
        yt = activations.softmax(
            K.dot(ytm, self.W_o)
            + K.dot(stm, self.U_o)
            + K.dot(context, self.C_o)
            + self.b_o)
 
        if self.return_probabilities:
            return at, [yt, st]
        else:
            return yt, [yt, st]
 
    def compute_output_shape(self, input_shape):
        """
            For Keras internal compatability checking
        """
        if self.return_probabilities:
            return (None, self.timesteps, self.timesteps)
        else:
            return (None, self.timesteps, self.output_dim)
 
    def get_config(self):
        """
            For rebuilding models on load time.
        """
        config = {
            'output_dim': self.output_dim,
            'units': self.units,
            'return_probabilities': self.return_probabilities
        }
        base_config = super(AttentionDecoder, self).get_config()
        return dict(list(base_config.items()) + list(config.items()))

We can make use of this custom layer in our projects by importing it as follows:

1
from attention_decoder import AttentionDecoder

The layer implements attention as described by Bahdanau, et al. in their paper “Neural Machine Translation by Jointly Learning to Align and Translate.”

The code is explained well in the original post and linked to both the LSTM and attention equations.

A limitation of this implementation is that it must output sequences that are the same length as the input sequences, the specific limitation that the encoder-decoder architecture was designed to overcome.

Importantly, the new layer manages both the repeating of the decoding as performed by the second LSTM, as well as the softmax output for the model as was performed by the Dense output layer in the encoder-decoder model without attention. This greatly simplifies the code for the model.

It is important to note that the custom layer is built upon the Recurrent layer in Keras, which, at the time of writing, is marked as legacy code, and presumably will be removed from the project at some point.

Encoder-Decoder With Attention

Now that we have an implementation of attention that we can use, we can develop an encoder-decoder model with attention for our contrived sequence prediction problem.

The model with the attention layer is defined below. We can see that the layer handles some of the machinery of the encoder-decoder model itself, making defining the model simpler.

1
2
3
4
5
# define model
model = Sequential()
model.add(LSTM(150, input_shape=(n_timesteps_in, n_features), return_sequences=True))
model.add(AttentionDecoder(150, n_features))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

That’s it. The rest of the example is the same.

The complete example is listed below.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
from random import randint
from numpy import array
from numpy import argmax
from numpy import array_equal
from keras.models import Sequential
from keras.layers import LSTM
from attention_decoder import AttentionDecoder
 
# generate a sequence of random integers
def generate_sequence(length, n_unique):
 return [randint(0, n_unique-1) for _ in range(length)]
 
# one hot encode sequence
def one_hot_encode(sequence, n_unique):
 encoding = list()
 for value in sequence:
 vector = [0 for _ in range(n_unique)]
 vector[value] = 1
 encoding.append(vector)
 return array(encoding)
 
# decode a one hot encoded string
def one_hot_decode(encoded_seq):
 return [argmax(vector) for vector in encoded_seq]
 
# prepare data for the LSTM
def get_pair(n_in, n_out, cardinality):
 # generate random sequence
 sequence_in = generate_sequence(n_in, cardinality)
 sequence_out = sequence_in[:n_out] + [0 for _ in range(n_in-n_out)]
 # one hot encode
 X = one_hot_encode(sequence_in, cardinality)
 y = one_hot_encode(sequence_out, cardinality)
 # reshape as 3D
 X = X.reshape((1, X.shape[0], X.shape[1]))
 y = y.reshape((1, y.shape[0], y.shape[1]))
 return X,y
 
# configure problem
n_features = 50
n_timesteps_in = 5
n_timesteps_out = 2
 
# define model
model = Sequential()
model.add(LSTM(150, input_shape=(n_timesteps_in, n_features), return_sequences=True))
model.add(AttentionDecoder(150, n_features))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
# train LSTM
for epoch in range(5000):
 # generate new random sequence
 X,y = get_pair(n_timesteps_in, n_timesteps_out, n_features)
 # fit model for one epoch on this sequence
 model.fit(X, y, epochs=1, verbose=2)
# evaluate LSTM
total, correct = 100, 0
for _ in range(total):
 X,y = get_pair(n_timesteps_in, n_timesteps_out, n_features)
 yhat = model.predict(X, verbose=0)
 if array_equal(one_hot_decode(y[0]), one_hot_decode(yhat[0])):
 correct += 1
print('Accuracy: %.2f%%' % (float(correct)/float(total)*100.0))
# spot check some examples
for _ in range(10):
 X,y = get_pair(n_timesteps_in, n_timesteps_out, n_features)
 yhat = model.predict(X, verbose=0)
 print('Expected:', one_hot_decode(y[0]), 'Predicted', one_hot_decode(yhat[0]))

Running the example prints the skill of the model on 100 randomly generated input-output pairs.

With the same resources and same amount of training, the model with attention performs much better.

1
Accuracy: 95.00%

Spot-checking some sample outputs and predicted sequences, we can see very few errors, even in cases when there is a zero value in the first two elements.

1
2
3
4
5
6
7
8
9
10
Expected: [48, 47, 0, 0, 0] Predicted [48, 47, 0, 0, 0]
Expected: [7, 46, 0, 0, 0] Predicted [7, 46, 0, 0, 0]
Expected: [32, 30, 0, 0, 0] Predicted [32, 2, 0, 0, 0]
Expected: [3, 25, 0, 0, 0] Predicted [3, 25, 0, 0, 0]
Expected: [45, 4, 0, 0, 0] Predicted [45, 4, 0, 0, 0]
Expected: [49, 9, 0, 0, 0] Predicted [49, 9, 0, 0, 0]
Expected: [22, 23, 0, 0, 0] Predicted [22, 23, 0, 0, 0]
Expected: [29, 36, 0, 0, 0] Predicted [29, 36, 0, 0, 0]
Expected: [0, 29, 0, 0, 0] Predicted [0, 29, 0, 0, 0]
Expected: [11, 26, 0, 0, 0] Predicted [11, 26, 0, 0, 0]

Comparison of Models

Although we are getting better results from the model with attention, the results were reported from a single run of each model.

In this case, we seek a more robust finding by repeating the evaluation of each model multiple times and reporting the average performance over those runs. For more information on this robust approach to evaluating neural network models, see the post:

How to Evaluate the Skill of Deep Learning Models

We can define a function to create each type of model, as follows.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
# define the encoder-decoder model
def baseline_model(n_timesteps_in, n_features):
 model = Sequential()
 model.add(LSTM(150, input_shape=(n_timesteps_in, n_features)))
 model.add(RepeatVector(n_timesteps_in))
 model.add(LSTM(150, return_sequences=True))
 model.add(TimeDistributed(Dense(n_features, activation='softmax')))
 model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
 return model
 
# define the encoder-decoder with attention model
def attention_model(n_timesteps_in, n_features):
 model = Sequential()
 model.add(LSTM(150, input_shape=(n_timesteps_in, n_features), return_sequences=True))
 model.add(AttentionDecoder(150, n_features))
 model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
 return model

We can then define a function to fit and evaluate the accuracy of a fit model and return the accuracy score.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
# train and evaluate a model, return accuracy
def train_evaluate_model(model, n_timesteps_in, n_timesteps_out, n_features):
 # train LSTM
 for epoch in range(5000):
 # generate new random sequence
 X,y = get_pair(n_timesteps_in, n_timesteps_out, n_features)
 # fit model for one epoch on this sequence
 model.fit(X, y, epochs=1, verbose=0)
 # evaluate LSTM
 total, correct = 100, 0
 for _ in range(total):
 X,y = get_pair(n_timesteps_in, n_timesteps_out, n_features)
 yhat = model.predict(X, verbose=0)
 if array_equal(one_hot_decode(y[0]), one_hot_decode(yhat[0])):
 correct += 1
 return float(correct)/float(total)*100.0

Putting this together, we can repeat the process of creating, training, and evaluating each type of model multiple times and reporting the mean accuracy over the repeats. To keep running times down, we will repeat each model evaluation 10 times, although if you have the resources, you could increase this to 30 or 100 times.

The complete example is listed below.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
from random import randint
from numpy import array
from numpy import argmax
from numpy import array_equal
from keras.models import Sequential
from keras.layers import LSTM
from keras.layers import Dense
from keras.layers import TimeDistributed
from keras.layers import RepeatVector
from attention_decoder import AttentionDecoder
 
# generate a sequence of random integers
def generate_sequence(length, n_unique):
 return [randint(0, n_unique-1) for _ in range(length)]
 
# one hot encode sequence
def one_hot_encode(sequence, n_unique):
 encoding = list()
 for value in sequence:
 vector = [0 for _ in range(n_unique)]
 vector[value] = 1
 encoding.append(vector)
 return array(encoding)
 
# decode a one hot encoded string
def one_hot_decode(encoded_seq):
 return [argmax(vector) for vector in encoded_seq]
 
# prepare data for the LSTM
def get_pair(n_in, n_out, cardinality):
 # generate random sequence
 sequence_in = generate_sequence(n_in, cardinality)
 sequence_out = sequence_in[:n_out] + [0 for _ in range(n_in-n_out)]
 # one hot encode
 X = one_hot_encode(sequence_in, cardinality)
 y = one_hot_encode(sequence_out, cardinality)
 # reshape as 3D
 X = X.reshape((1, X.shape[0], X.shape[1]))
 y = y.reshape((1, y.shape[0], y.shape[1]))
 return X,y
 
# define the encoder-decoder model
def baseline_model(n_timesteps_in, n_features):
 model = Sequential()
 model.add(LSTM(150, input_shape=(n_timesteps_in, n_features)))
 model.add(RepeatVector(n_timesteps_in))
 model.add(LSTM(150, return_sequences=True))
 model.add(TimeDistributed(Dense(n_features, activation='softmax')))
 model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
 return model
 
# define the encoder-decoder with attention model
def attention_model(n_timesteps_in, n_features):
 model = Sequential()
 model.add(LSTM(150, input_shape=(n_timesteps_in, n_features), return_sequences=True))
 model.add(AttentionDecoder(150, n_features))
 model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
 return model
 
# train and evaluate a model, return accuracy
def train_evaluate_model(model, n_timesteps_in, n_timesteps_out, n_features):
 # train LSTM
 for epoch in range(5000):
 # generate new random sequence
 X,y = get_pair(n_timesteps_in, n_timesteps_out, n_features)
 # fit model for one epoch on this sequence
 model.fit(X, y, epochs=1, verbose=0)
 # evaluate LSTM
 total, correct = 100, 0
 for _ in range(total):
 X,y = get_pair(n_timesteps_in, n_timesteps_out, n_features)
 yhat = model.predict(X, verbose=0)
 if array_equal(one_hot_decode(y[0]), one_hot_decode(yhat[0])):
 correct += 1
 return float(correct)/float(total)*100.0
 
# configure problem
n_features = 50
n_timesteps_in = 5
n_timesteps_out = 2
n_repeats = 10
# evaluate encoder-decoder model
print('Encoder-Decoder Model')
results = list()
for _ in range(n_repeats):
 model = baseline_model(n_timesteps_in, n_features)
 accuracy = train_evaluate_model(model, n_timesteps_in, n_timesteps_out, n_features)
 results.append(accuracy)
 print(accuracy)
print('Mean Accuracy: %.2f%%' % (sum(results)/float(n_repeats)))
# evaluate encoder-decoder with attention model
print('Encoder-Decoder With Attention Model')
results = list()
for _ in range(n_repeats):
 model = attention_model(n_timesteps_in, n_features)
 accuracy = train_evaluate_model(model, n_timesteps_in, n_timesteps_out, n_features)
 results.append(accuracy)
 print(accuracy)
print('Mean Accuracy: %.2f%%' % (sum(results)/float(n_repeats)))

Running this example prints the accuracy for each model repeat to give you an idea of the progress of the run.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
Encoder-Decoder Model
20.0
23.0
23.0
18.0
28.000000000000004
28.999999999999996
23.0
26.0
21.0
20.0
Mean Accuracy: 23.10%
 
Encoder-Decoder With Attention Model
98.0
91.0
94.0
93.0
96.0
99.0
97.0
94.0
99.0
96.0
Mean Accuracy: 95.70%

We can see that even averaged over 10 runs, the attention model still shows better performance than the encoder-decoder model without attention, 23.10% vs 95.70%.

A good extension to this evaluation would be to capture the model loss each epoch for each model, take the average, and compare how the loss changes over time for the architecture with and without attention.

I expect that this trace would show attention achieving better skill much faster and sooner than the non-attentional model, further highlighting the benefit of the approach.

Summary

In this tutorial, you discovered how to develop an encoder-decoder recurrent neural network with attention in Python with Keras.

Specifically, you learned:

How to design a small and configurable problem to evaluate encoder-decoder recurrent neural networks with and without attention.
How to design and evaluate an encoder-decoder network with and without attention for the sequence prediction problem.
How to robustly compare the performance of encoder-decoder networks with and without attention.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer

Artificial Intelligence , Machine Learning and Data Science Hubspot

Tuesday, 5 November 2024

How to Develop an Encoder-Decoder Model with Attention in Keras

How to Develop an Encoder-Decoder Model with Attention in Keras

Tutorial Overview

Python Environment

Encoder-Decoder with Attention

Test Problem for Attention

Encoder-Decoder Without Attention

Custom Keras Attention Layer

Encoder-Decoder With Attention

Comparison of Models

Further Reading

Summary

No comments:

Post a Comment

The Chain Rule of Calculus for Univariate and Multivariate Functions

Report Abuse

Labels

"Donate for a Noble Cause