Language modeling involves predicting the next word in a sequence given the sequence of words already present.
A language model is a key element in many natural language processing models such as machine translation and speech recognition. The choice of how the language model is framed must match how the language model is intended to be used.
In this tutorial, you will discover how the framing of a language model affects the skill of the model when generating short sequences from a nursery rhyme.
After completing this tutorial, you will know:
- The challenge of developing a good framing of a word-based language model for a given application.
- How to develop one-word, two-word, and line-based framings for word-based language models.
- How to generate sequences using a fit language model.
Kick-start your project with my new book Deep Learning for Natural Language Processing, including step-by-step tutorials and the Python source code files for all examples.
Let’s get started.
Tutorial Overview
This tutorial is divided into 5 parts; they are:
- Framing Language Modeling
- Jack and Jill Nursery Rhyme
- Model 1: One-Word-In, One-Word-Out Sequences
- Model 2: Line-by-Line Sequence
- Model 3: Two-Words-In, One-Word-Out Sequence
Need help with Deep Learning for Text Data?
Take my free 7-day email crash course now (with code).
Click to sign-up and also get a free PDF Ebook version of the course.
Framing Language Modeling
A statistical language model is learned from raw text and predicts the probability of the next word in the sequence given the words already present in the sequence.
Language models are a key component in larger models for challenging natural language processing problems, like machine translation and speech recognition. They can also be developed as standalone models and used for generating new sequences that have the same statistical properties as the source text.
Language models both learn and predict one word at a time. The training of the network involves providing sequences of words as input that are processed one at a time where a prediction can be made and learned for each input sequence.
Similarly, when making predictions, the process can be seeded with one or a few words, then predicted words can be gathered and presented as input on subsequent predictions in order to build up a generated output sequence
Therefore, each model will involve splitting the source text into input and output sequences, such that the model can learn to predict words.
There are many ways to frame the sequences from a source text for language modeling.
In this tutorial, we will explore 3 different ways of developing word-based language models in the Keras deep learning library.
There is no single best approach, just different framings that may suit different applications.
Jack and Jill Nursery Rhyme
Jack and Jill is a simple nursery rhyme.
It is comprised of 4 lines, as follows:
Jack and Jill went up the hill
To fetch a pail of water
Jack fell down and broke his crown
And Jill came tumbling after
We will use this as our source text for exploring different framings of a word-based language model.
We can define this text in Python as follows:
Model 1: One-Word-In, One-Word-Out Sequences
We can start with a very simple model.
Given one word as input, the model will learn to predict the next word in the sequence.
For example:
The first step is to encode the text as integers.
Each lowercase word in the source text is assigned a unique integer and we can convert the sequences of words to sequences of integers.
Keras provides the Tokenizer class that can be used to perform this encoding. First, the Tokenizer is fit on the source text to develop the mapping from words to unique integers. Then sequences of text can be converted to sequences of integers by calling the texts_to_sequences() function.
We will need to know the size of the vocabulary later for both defining the word embedding layer in the model, and for encoding output words using a one hot encoding.
The size of the vocabulary can be retrieved from the trained Tokenizer by accessing the word_index attribute.
Running this example, we can see that the size of the vocabulary is 21 words.
We add one, because we will need to specify the integer for the largest encoded word as an array index, e.g. words encoded 1 to 21 with array indicies 0 to 21 or 22 positions.
Next, we need to create sequences of words to fit the model with one word as input and one word as output.
Running this piece shows that we have a total of 24 input-output pairs to train the network.
We can then split the sequences into input (X) and output elements (y). This is straightforward as we only have two columns in the data.
We will fit our model to predict a probability distribution across all words in the vocabulary. That means that we need to turn the output element from a single integer into a one hot encoding with a 0 for every word in the vocabulary and a 1 for the actual word that the value. This gives the network a ground truth to aim for from which we can calculate error and update the model.
Keras provides the to_categorical() function that we can use to convert the integer to a one hot encoding while specifying the number of classes as the vocabulary size.
We are now ready to define the neural network model.
The model uses a learned word embedding in the input layer. This has one real-valued vector for each word in the vocabulary, where each word vector has a specified length. In this case we will use a 10-dimensional projection. The input sequence contains a single word, therefore the input_length=1.
The model has a single hidden LSTM layer with 50 units. This is far more than is needed. The output layer is comprised of one neuron for each word in the vocabulary and uses a softmax activation function to ensure the output is normalized to look like a probability.
The structure of the network can be summarized as follows:
We will use this same general network structure for each example in this tutorial, with minor changes to the learned embedding layer.
Next, we can compile and fit the network on the encoded text data. Technically, we are modeling a multi-class classification problem (predict the word in the vocabulary), therefore using the categorical cross entropy loss function. We use the efficient Adam implementation of gradient descent and track accuracy at the end of each epoch. The model is fit for 500 training epochs, again, perhaps more than is needed.
The network configuration was not tuned for this and later experiments; an over-prescribed configuration was chosen to ensure that we could focus on the framing of the language model.
After the model is fit, we test it by passing it a given word from the vocabulary and having the model predict the next word. Here we pass in ‘Jack‘ by encoding it and calling model.predict_classes() to get the integer output for the predicted word. This is then looked up in the vocabulary mapping to give the associated word.
This process could then be repeated a few times to build up a generated sequence of words.
To make this easier, we wrap up the behavior in a function that we can call by passing in our model and the seed word.
We can time all of this together. The complete code listing is provided below.
Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.
Running the example prints the loss and accuracy each training epoch.
We can see that the model does not memorize the source sequences, likely because there is some ambiguity in the input sequences, for example:
And so on.
At the end of the run, ‘Jack‘ is passed in and a prediction or new sequence is generated.
We get a reasonable sequence as output that has some elements of the source.
This is a good first cut language model, but does not take full advantage of the LSTM’s ability to handle sequences of input and disambiguate some of the ambiguous pairwise sequences by using a broader context.
Model 2: Line-by-Line Sequence
Another approach is to split up the source text line-by-line, then break each line down into a series of words that build up.
For example:
This approach may allow the model to use the context of each line to help the model in those cases where a simple one-word-in-and-out model creates ambiguity.
In this case, this comes at the cost of predicting words across lines, which might be fine for now if we are only interested in modeling and generating lines of text.
Note that in this representation, we will require a padding of sequences to ensure they meet a fixed length input. This is a requirement when using Keras.
First, we can create the sequences of integers, line-by-line by using the Tokenizer already fit on the source text.
Next, we can pad the prepared sequences. We can do this using the pad_sequences() function provided in Keras. This first involves finding the longest sequence, then using that as the length by which to pad-out all other sequences.
Next, we can split the sequences into input and output elements, much like before.
The model can then be defined as before, except the input sequences are now longer than a single word. Specifically, they are max_length-1 in length, -1 because when we calculated the maximum length of sequences, they included the input and output elements.
We can use the model to generate new sequences as before. The generate_seq() function can be updated to build up an input sequence by adding predictions to the list of input words each iteration.
Tying all of this together, the complete code example is provided below.
Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.
Running the example achieves a better fit on the source data. The added context has allowed the model to disambiguate some of the examples.
There are still two lines of text that start with ‘Jack‘ that may still be a problem for the network.
At the end of the run, we generate two sequences with different seed words: ‘Jack‘ and ‘Jill‘.
The first generated line looks good, directly matching the source text. The second is a bit strange. This makes sense, because the network only ever saw ‘Jill‘ within an input sequence, not at the beginning of the sequence, so it has forced an output to use the word ‘Jill‘, i.e. the last line of the rhyme.
This was a good example of how the framing may result in better new lines, but not good partial lines of input.
Model 3: Two-Words-In, One-Word-Out Sequence
We can use an intermediate between the one-word-in and the whole-sentence-in approaches and pass in a sub-sequences of words as input.
This will provide a trade-off between the two framings allowing new lines to be generated and for generation to be picked up mid line.
We will use 3 words as input to predict one word as output. The preparation of the sequences is much like the first example, except with different offsets in the source sequence arrays, as follows:
The complete example is listed below
Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.
Running the example again gets a good fit on the source text at around 95% accuracy.
We look at 4 generation examples, two start of line cases and two starting mid line.
The first start of line case generated correctly, but the second did not. The second case was an example from the 4th line, which is ambiguous with content from the first line. Perhaps a further expansion to 3 input words would be better.
The two mid-line generation examples were generated correctly, matching the source text.
We can see that the choice of how the language model is framed and the requirements on how the model will be used must be compatible. That careful design is required when using language models in general, perhaps followed-up by spot testing with sequence generation to confirm model requirements have been met.
Extensions
This section lists some ideas for extending the tutorial that you may wish to explore.
- Whole Rhyme as Sequence. Consider updating one of the above examples to build up the entire rhyme as an input sequence. The model should be able to generate the entire thing given the seed of the first word, demonstrate this.
- Pre-Trained Embeddings. Explore using pre-trained word vectors in the embedding instead of learning the embedding as part of the model. This would not be required on such a small source text, but could be good practice.
- Character Models. Explore the use of a character-based language model for the source text instead of the word-based approach demonstrated in this tutorial.
Further Reading
This section provides more resources on the topic if you are looking go deeper.
- Jack and Jill on Wikipedia
- Language Model on Wikpedia
- Keras Embedding Layer API
- Keras Text Processing API
- Keras Sequence Processing API
- Keras Utils API
Summary
In this tutorial, you discovered how to develop different word-based language models for a simple nursery rhyme.
Specifically, you learned:
- The challenge of developing a good framing of a word-based language model for a given application.
- How to develop one-word, two-word, and line-based framings for word-based language models.
- How to generate sequences using a fit language model.
Do you have any questions?
Ask your questions in the comments below and I will do my best to answer