Unlock the Power of Artificial Intelligence, Machine Learning, and Data Science with our Blog
Discover the latest insights, trends, and innovations in Artificial Intelligence (AI), Machine Learning (ML), and Data Science through our informative and engaging Hubspot blog. Gain a deep understanding of how these transformative technologies are shaping industries and revolutionizing the way we work.
Stay updated with cutting-edge advancements, practical applications, and real-world use.
Thursday, 22 February 2024
Text Generation with LSTM in PyTorch
Recurrent neural network can be used for time series prediction. In which, a regression neural network is created. It can also be used as generative model, which usually is a classification neural network model. A generative model is to learn certain pattern from data, such that when it is presented with some prompt, it can create a complete output that in the same style as the learned pattern.
In this post, you will discover how to build a generative model for text using LSTM recurrent neural networks in PyTorch. After finishing this post, you will know:
Where to download a free corpus of text that you can use to train text generative models
How to frame the problem of text sequences to a recurrent neural network generative model
How to develop an LSTM to generate plausible text sequences for a given problem
Overview
This post is divided into six parts; they are:
What is a Generative Model
Getting Text Data
A Small LSTM Network to Predit the Next Character
Generating Text with an LSTM Model
Using a Larger LSTM Network
Faster Training with GPU
What is a Generative Model
Generative model is indeed, just another machine learning model that happened to be able to create new things. Generative Adverserial Network (GAN) is a class of its own. Transformer models that uses attention mechanism is also found to be useful to generate text passages.
It is just a machine learning model because the model has been trained with existing data, so that it learned something from it. Depends on how to train it, they can work vastly different. In this post, a character-based generative model is created. What it means is to train a model that take a sequence of characters (alphabets and punctuations) as input and the immediate next character as the target. As long as it can predict what is the next character given what are preceding, you can run the model in a loop to generate a long piece of text.
This model is probably the simplest one. However, human language is complex. You shouldn’t expect it can produce very high quality output. Even so, you need a lot of data and train the model for a long time before you can see sensible results.
Want to Get Started With Deep Learning with PyTorch?
Take my free email crash course now (with sample code).
Click to sign-up and also get a free PDF Ebook version of the course.
Getting Text Data
Obtaining high quality data is important for a successful generative model. Fortunately, many of the classical texts are no longer protected under copyright. This means you can download all the text for these books for free and use them in experiments, like creating generative models. Perhaps the best place to get access to free books that are no longer protected by copyright is Project Gutenberg.
In this post, you will use a favorite book from childhood as the dataset, Alice’s Adventures in Wonderland by Lewis Carroll:
Your model will learn the dependencies between characters and the conditional probabilities of characters in sequences so that you can, in turn, generate wholly new and original sequences of characters. This post is a lot of fun, and repeating these experiments with other books from Project Gutenberg is recommended. These experiments are not limited to text; you can also experiment with other ASCII data, such as computer source code, marked-up documents in LATEX, HTML or Markdown, and more.
You can download the complete text in ASCII format (Plaintext UTF-8) for this book for free and place it in your working directory with the filename wonderland.txt. Now, you need to prepare the dataset ready for modeling. Project Gutenberg adds a standard header and footer to each book, which is not part of the original text. Open the file in a text editor and delete the header and footer. The header is obvious and ends with the text:
1
*** START OF THIS PROJECT GUTENBERG EBOOK ALICE'S ADVENTURES IN WONDERLAND ***
The footer is all the text after the line of text that says:
1
THE END
You should be left with a text file that has about 3,400 lines of text.
A Small LSTM Network to Predict the Next Character
First, you need to do some preprocessing on the data before you can build a model. Neural network models can only work with numbers, not text. Therefore you need to transform the characters into numbers. To make the problem simpler, you also want to transform all uppercase letters into lowercase.
In below, you open the text file, transform all letters into lowercase, and create a Python dict char_to_int to map characters into distinct integers. For example, the list of unique sorted lowercase characters in the book is as follows:
You can see the book has just under 150,000 characters, and when converted to lowercase, there are only 50 distinct characters in the vocabulary for the network to learn — much more than the 26 in the alphabet.
Next, you need to separate the text into inputs and targets. A window of 100 character is used here. That is, with character 1 to 100 as input, your model is going to predict for character 101. Should a window of 5 be used, the word “chapter” will become two data samples:
1
2
chapt -> e
hapte -> r
In a long text such as this one, a myraid of windows can be created and this produced a dataset of a lot of samples:
1
2
3
4
5
6
7
8
9
10
11
# prepare the dataset of input to output pairs encoded as integers
seq_length=100
dataX=[]
dataY=[]
foriinrange(0,n_chars-seq_length,1):
seq_in=raw_text[i:i+seq_length]
seq_out=raw_text[i+seq_length]
dataX.append([char_to_int[char]forcharinseq_in])
dataY.append(char_to_int[seq_out])
n_patterns=len(dataX)
print("Total Patterns: ",n_patterns)
Running the above, you can see a total of 144,474 samples are created. Each sample is now in the form of integers, transformed using the mapping char_to_int. However, a PyTorch model would prefer to see the data in floating point tensors. Hence you should convert these into PyTorch tensors. LSTM layer is going to be used in the model, thus the input tensor should be of dimension (sample, time steps, features). To help training, it is also a good idea to normalize the input to 0 to 1. Hence you have the following:
You can now define your LSTM model. Here, you define a single hidden LSTM layer with 256 hidden units. The input is single feature (i.e., one integer for one character). A dropout layer with probability 0.2 is added after the LSTM layer. The output of LSTM layer is a tuple, which the first element is the hidden states from the LSTM cell for each of the time step. It is a history of how the hidden state evolved as the LSTM cell accepts each time step of input. Presumably, the last hidden state contained the most information, hence only the last hidden state is pass on to the output layer. The output layer is a fully-connected layer to produce logits for the 50 vocabularies. The logits can be converted into probability-like prediction using a softmax function.
This is a model for single character classification of 50 classes. Therefore cross entropy loss should be used. It is optimized using Adam optimizer. The training loop is as follows. For simplicity, no test set has created, but the model is evaluated with the training set once again at the end of each epoch to keep track on the progress.
This program can run for a long time, especially on CPU! In order to preserve the fruit of work, the best model ever found is saved for future reuse.
The cross entropy almost always decreasing in each epoch. This means probably the model is not fully converged and you can train it for more epochs. Upon the training loop completed, you should have the file single-char.pth created to contain the best model weight ever found, as well as the character-to-integer mapping used by this model.
For completeness, below is tying everything above into one script:
Given the model is well trained, generating text using the trained LSTM network is relatively straightforward. Firstly, you need to recreate the network and load the trained model weight from the saved checkpoint. Then you need to create some prompt for the model to start on. The prompt can be anything that the model can understand. It is a seed sequence to be given to the model to obtain one generated character. Then, the generated character is added to the end of this sequence, and trim off the first character to maintain the consistent length. This process is repeated for as long as you want to predict new characters (e.g., a sequence of 1,000 characters in length). You can pick a random input pattern as your seed sequence, then print generated characters as you generate them.
A simple way to generate prompt is to pick a random sample from the original dataset, e.g., with the raw_text obtained in the previous section, a prompt can be created as:
# append the new character into the prompt for the next iteration
pattern.append(index)
pattern=pattern[1:]
print()
print("Done.")
Running this example first outputs the prompt used, then each character as it is generated. For example, below are the results from one run of this text generator. The prompt was:
1
2
Prompt: "nother rush at the stick, and tumbled head
over heels in its hurry to get hold of it; then alice, th"
The generated text was:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
e was qot a litule soteet of thet was sh the thiee harden an the courd, and was tuitk a little toaee th thite ththe and said to the suher, and the whrtght the pacbit sese tha woode of the soeee, and the white rabbit ses ani thr gort to the thite rabbit, and then she was aoiinnene th the three baaed of the sueen and saed “ota turpe ”hun mot,”
“i don’t know the ter ano _enend to mere,” said the maccht ar a sore of great roaee. “ie you don’t teink if thet soued to soeed to the boeie the mooer, io you bane thing it wo
tou het bn the crur,
“h whsh you cen not,” said the manch hare.
“wes, it aadi,” said the manch hare.
“weat you tail to merer ae in an a gens if gre” ”he were thing,” said the maccht ar a sore of geeaghen asd tothe to the thieg harden an the could.
“h dan tor toe taie thing,” said the manch hare.
“wes, it aadi,” said the manch hare.
“weat you tail to merer ae in an a gens if gre” ”he were thing,” said the maccht ar a sore of geeaghen asd tothe to the thieg harden an t
Let’s note some observations about the generated text.
It can emit line breaks. The original text limited the line width to 80 characters and the generative model attempted to replicate this pattern
The characters are separated into word-like groups, and some groups are actual English words (e.g., “the,” “said,” and “rabbit”), but many are not (e.g., “thite,” “soteet,” and “tha”).
Some of the words in sequence make sense (e.g., “i don’t know the“), but many do not (e.g., “he were thing“).
The fact that this character-based model of the book produces output like this is very impressive. It gives you a sense of the learning capabilities of LSTM networks. However, the results are not perfect. In the next section, you will look at improving the quality of results by developing a much larger LSTM network.
Using a Larger LSTM Network
Recall that LSTM is a recurrent neural network. It takes a sequence as input, which in each step of the sequence, the input is mixed with its internal states to produce an output. Hence the output from LSTM is also a sequence. In the above, the output from the last time step is taken for further processing in the neural network but those from earlier steps are discarded. However, it is not necessarily the case. You can treat the sequence output from one LSTM layer as input to another LSTM layer. Then, you are building a larger network.
Similar to convolutional neural networks, a stacked LSTM network is supposed to have the earlier LSTM layers to learn low level features while the later LSTM layers to learn the high level features. It may not be always useful but you can try it out to see whether the model can produce a better result.
In PyTorch, making a stacked LSTM layer is easy. Let’s modify the above model into the following:
The only change is on the parameter to nn.LSTM(): you set num_layers=2 instead of 1 to add another LSTM layer. But between the two LSTM layers, you also added a dropout layer through the parameter dropout=0.2. Replacing this model with the previous is all the change you need to make. Rerun the training you should see the below:
1
2
3
4
5
6
7
...
Epoch 34: Cross-entropy: 203763.0312
Epoch 35: Cross-entropy: 204002.5938
Epoch 36: Cross-entropy: 210636.5625
Epoch 37: Cross-entropy: 199619.6875
Epoch 38: Cross-entropy: 199240.2969
Epoch 39: Cross-entropy: 196966.1250
You should see the the cross entropy here is lower than that in the previous section. This means this model is performing better. In fact, with this model, you can see the generated text looks more sensible:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
Prompt: "ll
say that ‘i see what i eat’ is the same thing as ‘i eat what i see’!”
“you might just as well sa"
y it to sea,” she katter said to the jury. and the thoee hardeners vhine she was seady to alice the was a long tay of the sooe of the court, and she was seady to and taid to the coor and the court.
“well you see what you see, the mookee of the soog of the season of the shase of the court!”
“i don’t know the rame thing is it?” said the caterpillar.
“the cormous was it makes he it was it taie the reason of the shall bbout it, you know.”
“i don’t know the rame thing i can’t gelp the sea,” the hatter went on, “i don’t know the peally was in the shall sereat it would be a teally.
the mookee of the court ”
“i don’t know the rame thing is it?” said the caterpillar.
“the cormous was it makes he it was it taie the reason of the shall bbout it, you know.”
“i don’t know the rame thing i can’t gelp the sea,” the hatter went on, “i don’t know the peally was in the shall sereat it would be a teally.
the mookee of the court ”
“i don’t know the rame thing is it?” said the caterpillar.
“the
Done.
Not only words are spelled correctly, the text is also more English-like. Since the cross-entropy loss is still decreasing as you trained the model, you can assume the model is not converged yet. You can expect to make the model better if you increased the training epoch.
For completeness, below is the complete code for using this new model, including training and text generation.
# append the new character into the prompt for the next iteration
pattern.append(index)
pattern=pattern[1:]
print()
print("Done.")
Faster Training with GPU
Running programs from this post can be pathetically slow. Even if you have a GPU, you will not see immediate improvement. It is because the design of PyTorch, it may not use your GPU automatically. However, if you have a CUDA-capable GPU, you can improve the performance a lot by carefully moving the heavy computation away from your CPU.
A PyTorch model is a program of tensor calculation. The tensors can be stored in GPU or in CPU. Operation can be carried out as long as all the operators are in the same device. In this particular example, the model weight (i.e., those of the LSTM layers and the fully connected layer) can be moved to GPU. By doing so, the input should also be moved to the GPU before execution and the output will also be stored in the GPU unless you move it back.
In PyTorch, you can check if you have a CUDA-capable GPU using the following function:
1
torch.cuda.is_available()
It returns a boolean to indicate if you can use GPU, which in turn, depends on the hardware model you have, whether your OS has the appropriate library installed, and whether your PyTorch is compiled with corresponding GPU support. If everything works in concert, you can create a device and assign your model to it:
1
2
device=torch.device("cuda:0")
model.to(device)
If your model is running on CUDA device but your input tensor is not, you will see PyTorch complain about that and fail to proceed. To move your tensor to the CUDA device, you should run like the following:
1
y_pred=model(X_batch.to(device))
Which the .to(device) part will do the magic. But remember that y_pred produced above will also be on the CUDA device. Hence you should do the same when you run the loss function. Modifying the above program to make it capable to run on GPU will become the following:
which will be your GPU or fall back to CPU if no CUDA device is found. Afterward, .to(device) is added at several strategic location to move the computation to the GPU.
Further Readings
This character text model is a popular way of generating text using recurrent neural networks. Below are some more resources and tutorials on the topic if you are interested in going deeper.
Ilya Sutskever, James Martens, and Geoffrey Hinton. “Generating Text with Recurrent Neural Networks”. In: Proceedings of the 28th International Conference on Machine Learning. Bellevue, WA, USA, 2011.
In this post, you discovered how you can develop an LSTM recurrent neural network for text generation in PyTorch. After completing this post, you know:
How to find text for classical books for free as dataset for your machine learning model
How to train an LSTM network for text sequences
How to use a LSTM network to generate text sequencesHow to optimize deep learning training in PyTorch using CUDA devices
No comments:
Post a Comment