Word embeddings provide a dense representation of words and their relative meanings.
They are an improvement over sparse representations used in simpler bag of word model representations.
Word embeddings can be learned from text data and reused among projects. They can also be learned as part of fitting a neural network on text data.
In this tutorial, you will discover how to use word embeddings for deep learning in Python with Keras.
After completing this tutorial, you will know:
- About word embeddings and that Keras supports word embeddings via the Embedding layer.
- How to learn a word embedding while fitting a neural network.
- How to use a pre-trained word embedding in a neural network.
Kick-start your project with my new book Deep Learning for Natural Language Processing, including step-by-step tutorials and the Python source code files for all examples.
Let’s get started.
- Updated Feb/2018: Fixed a bug due to a change in the underlying APIs.
- Updated Oct/2019: Updated for Keras 2.3 and TensorFlow 2.0.
Tutorial Overview
This tutorial is divided into 3 parts; they are:
- Word Embedding
- Keras Embedding Layer
- Example of Learning an Embedding
- Example of Using Pre-Trained GloVe Embedding
Need help with Deep Learning for Text Data?
Take my free 7-day email crash course now (with code).
Click to sign-up and also get a free PDF Ebook version of the course.
1. Word Embedding
A word embedding is a class of approaches for representing words and documents using a dense vector representation.
It is an improvement over more the traditional bag-of-word model encoding schemes where large sparse vectors were used to represent each word or to score each word within a vector to represent an entire vocabulary. These representations were sparse because the vocabularies were vast and a given word or document would be represented by a large vector comprised mostly of zero values.
Instead, in an embedding, words are represented by dense vectors where a vector represents the projection of the word into a continuous vector space.
The position of a word within the vector space is learned from text and is based on the words that surround the word when it is used.
The position of a word in the learned vector space is referred to as its embedding.
Two popular examples of methods of learning word embeddings from text include:
- Word2Vec.
- GloVe.
In addition to these carefully designed methods, a word embedding can be learned as part of a deep learning model. This can be a slower approach, but tailors the model to a specific training dataset.
2. Keras Embedding Layer
Keras offers an Embedding layer that can be used for neural networks on text data.
It requires that the input data be integer encoded, so that each word is represented by a unique integer. This data preparation step can be performed using the Tokenizer API also provided with Keras.
The Embedding layer is initialized with random weights and will learn an embedding for all of the words in the training dataset.
It is a flexible layer that can be used in a variety of ways, such as:
- It can be used alone to learn a word embedding that can be saved and used in another model later.
- It can be used as part of a deep learning model where the embedding is learned along with the model itself.
- It can be used to load a pre-trained word embedding model, a type of transfer learning.
The Embedding layer is defined as the first hidden layer of a network. It must specify 3 arguments:
It must specify 3 arguments:
- input_dim: This is the size of the vocabulary in the text data. For example, if your data is integer encoded to values between 0-10, then the size of the vocabulary would be 11 words.
- output_dim: This is the size of the vector space in which words will be embedded. It defines the size of the output vectors from this layer for each word. For example, it could be 32 or 100 or even larger. Test different values for your problem.
- input_length: This is the length of input sequences, as you would define for any input layer of a Keras model. For example, if all of your input documents are comprised of 1000 words, this would be 1000.
For example, below we define an Embedding layer with a vocabulary of 200 (e.g. integer encoded words from 0 to 199, inclusive), a vector space of 32 dimensions in which words will be embedded, and input documents that have 50 words each.
The Embedding layer has weights that are learned. If you save your model to file, this will include weights for the Embedding layer.
The output of the Embedding layer is a 2D vector with one embedding for each word in the input sequence of words (input document).
If you wish to connect a Dense layer directly to an Embedding layer, you must first flatten the 2D output matrix to a 1D vector using the Flatten layer.
Now, let’s see how we can use an Embedding layer in practice.
3. Example of Learning an Embedding
In this section, we will look at how we can learn a word embedding while fitting a neural network on a text classification problem.
We will define a small problem where we have 10 text documents, each with a comment about a piece of work a student submitted. Each text document is classified as positive “1” or negative “0”. This is a simple sentiment analysis problem.
First, we will define the documents and their class labels.
Next, we can integer encode each document. This means that as input the Embedding layer will have sequences of integers. We could experiment with other more sophisticated bag of word model encoding like counts or TF-IDF.
Keras provides the one_hot() function that creates a hash of each word as an efficient integer encoding. We will estimate the vocabulary size of 50, which is much larger than needed to reduce the probability of collisions from the hash function.
The sequences have different lengths and Keras prefers inputs to be vectorized and all inputs to have the same length. We will pad all input sequences to have the length of 4. Again, we can do this with a built in Keras function, in this case the pad_sequences() function.
We are now ready to define our Embedding layer as part of our neural network model.
The Embedding has a vocabulary of 50 and an input length of 4. We will choose a small embedding space of 8 dimensions.
The model is a simple binary classification model. Importantly, the output from the Embedding layer will be 4 vectors of 8 dimensions each, one for each word. We flatten this to a one 32-element vector to pass on to the Dense output layer.
Finally, we can fit and evaluate the classification model.
The complete code listing is provided below.
Running the example first prints the integer encoded documents.
Then the padded versions of each document are printed, making them all uniform length.
After the network is defined, a summary of the layers is printed. We can see that as expected, the output of the Embedding layer is a 4×8 matrix and this is squashed to a 32-element vector by the Flatten layer.
Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.
Finally, the accuracy of the trained model is printed, showing that it learned the training dataset perfectly (which is not surprising).
You could save the learned weights from the Embedding layer to file for later use in other models.
You could also use this model generally to classify other documents that have the same kind vocabulary seen in the test dataset.
Next, let’s look at loading a pre-trained word embedding in Keras.
4. Example of Using Pre-Trained GloVe Embedding
The Keras Embedding layer can also use a word embedding learned elsewhere.
It is common in the field of Natural Language Processing to learn, save, and make freely available word embeddings.
For example, the researchers behind GloVe method provide a suite of pre-trained word embeddings on their website released under a public domain license. See:
The smallest package of embeddings is 822Mb, called “glove.6B.zip“. It was trained on a dataset of one billion tokens (words) with a vocabulary of 400 thousand words. There are a few different embedding vector sizes, including 50, 100, 200 and 300 dimensions.
You can download this collection of embeddings and we can seed the Keras Embedding layer with weights from the pre-trained embedding for the words in your training dataset.
This example is inspired by an example in the Keras project: pretrained_word_embeddings.py.
After downloading and unzipping, you will see a few files, one of which is “glove.6B.100d.txt“, which contains a 100-dimensional version of the embedding.
If you peek inside the file, you will see a token (word) followed by the weights (100 numbers) on each line. For example, below are the first line of the embedding ASCII text file showing the embedding for “the“.
As in the previous section, the first step is to define the examples, encode them as integers, then pad the sequences to be the same length.
In this case, we need to be able to map words to integers as well as integers to words.
Keras provides a Tokenizer class that can be fit on the training data, can convert text to sequences consistently by calling the texts_to_sequences() method on the Tokenizer class, and provides access to the dictionary mapping of words to integers in a word_index attribute.
Next, we need to load the entire GloVe word embedding file into memory as a dictionary of word to embedding array.
This is pretty slow. It might be better to filter the embedding for the unique words in your training data.
Next, we need to create a matrix of one embedding for each word in the training dataset. We can do that by enumerating all unique words in the Tokenizer.word_index and locating the embedding weight vector from the loaded GloVe embedding.
The result is a matrix of weights only for words we will see during training.
Now we can define our model, fit, and evaluate it as before.
The key difference is that the embedding layer can be seeded with the GloVe word embedding weights. We chose the 100-dimensional version, therefore the Embedding layer must be defined with output_dim set to 100. Finally, we do not want to update the learned word weights in this model, therefore we will set the trainable attribute for the model to be False.
The complete worked example is listed below.
Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.
Running the example may take a bit longer, but then demonstrates that it is just as capable of fitting this simple problem.
In practice, I would encourage you to experiment with learning a word embedding using a pre-trained embedding that is fixed and trying to perform learning on top of a pre-trained embedding.
See what works best for your specific problem.
Further Reading
This section provides more resources on the topic if you are looking go deeper.
- Word Embedding on Wikipedia
- Keras Embedding Layer API
- Using pre-trained word embeddings in a Keras model, 2016
- Example of using a pre-trained GloVe Embedding in Keras
- GloVe Embedding
- An overview of word embeddings and their connection to distributional semantic models, 2016
- Deep Learning, NLP, and Representations, 2014
Summary
In this tutorial, you discovered how to use word embeddings for deep learning in Python with Keras.
Specifically, you learned:
- About word embeddings and that Keras supports word embeddings via the Embedding layer.
- How to learn a word embedding while fitting a neural network.
- How to use a pre-trained word embedding in a neural network.
Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.
No comments:
Post a Comment