Word embeddings are a modern approach for representing text in natural language processing.
Word embedding algorithms like word2vec and GloVe are key to the state-of-the-art results achieved by neural network models on natural language processing problems like machine translation.
In this tutorial, you will discover how to train and load word embedding models for natural language processing applications in Python using Gensim.
After completing this tutorial, you will know:
- How to train your own word2vec word embedding model on text data.
- How to visualize a trained word embedding model using Principal Component Analysis.
- How to load pre-trained word2vec and GloVe word embedding models from Google and Stanford.
Kick-start your project with my new book Deep Learning for Natural Language Processing, including step-by-step tutorials and the Python source code files for all examples.
Let’s get started.
Tutorial Overview
This tutorial is divided into 6 parts; they are:
- Word Embeddings
- Gensim Library
- Develop Word2Vec Embedding
- Visualize Word Embedding
- Load Google’s Word2Vec Embedding
- Load Stanford’s GloVe Embedding
Need help with Deep Learning for Text Data?
Take my free 7-day email crash course now (with code).
Click to sign-up and also get a free PDF Ebook version of the course.
Word Embeddings
A word embedding is an approach to provide a dense vector representation of words that capture something about their meaning.
Word embeddings are an improvement over simpler bag-of-word model word encoding schemes like word counts and frequencies that result in large and sparse vectors (mostly 0 values) that describe documents but not the meaning of the words.
Word embeddings work by using an algorithm to train a set of fixed-length dense and continuous-valued vectors based on a large corpus of text. Each word is represented by a point in the embedding space and these points are learned and moved around based on the words that surround the target word.
It is defining a word by the company that it keeps that allows the word embedding to learn something about the meaning of words. The vector space representation of the words provides a projection where words with similar meanings are locally clustered within the space.
The use of word embeddings over other text representations is one of the key methods that has led to breakthrough performance with deep neural networks on problems like machine translation.
In this tutorial, we are going to look at how to use two different word embedding methods called word2vec by researchers at Google and GloVe by researchers at Stanford.
Gensim Python Library
Gensim is an open source Python library for natural language processing, with a focus on topic modeling.
It is billed as:
topic modelling for humans
Gensim was developed and is maintained by the Czech natural language processing researcher Radim Řehůřek and his company RaRe Technologies.
It is not an everything-including-the-kitchen-sink NLP research library (like NLTK); instead, Gensim is a mature, focused, and efficient suite of NLP tools for topic modeling. Most notably for this tutorial, it supports an implementation of the Word2Vec word embedding for learning new word vectors from text.
It also provides tools for loading pre-trained word embeddings in a few formats and for making use and querying a loaded embedding.
We will use the Gensim library in this tutorial.
If you do not have a Python environment setup, you can use this tutorial:
Gensim can be installed easily using pip or easy_install.
For example, you can install Gensim with pip by typing the following on your command line:
If you need help installing Gensim on your system, you can see the Gensim Installation Instructions.
Develop Word2Vec Embedding
Word2vec is one algorithm for learning a word embedding from a text corpus.
There are two main training algorithms that can be used to learn the embedding from text; they are continuous bag of words (CBOW) and skip grams.
We will not get into the algorithms other than to say that they generally look at a window of words for each target word to provide context and in turn meaning for words. The approach was developed by Tomas Mikolov, formerly at Google and currently at Facebook.
Word2Vec models require a lot of text, e.g. the entire Wikipedia corpus. Nevertheless, we will demonstrate the principles using a small in-memory example of text.
Gensim provides the Word2Vec class for working with a Word2Vec model.
Learning a word embedding from text involves loading and organizing the text into sentences and providing them to the constructor of a new Word2Vec() instance. For example:
Specifically, each sentence must be tokenized, meaning divided into words and prepared (e.g. perhaps pre-filtered and perhaps converted to a preferred case).
The sentences could be text loaded into memory, or an iterator that progressively loads text, required for very large text corpora.
There are many parameters on this constructor; a few noteworthy arguments you may wish to configure are:
- size: (default 100) The number of dimensions of the embedding, e.g. the length of the dense vector to represent each token (word).
- window: (default 5) The maximum distance between a target word and words around the target word.
- min_count: (default 5) The minimum count of words to consider when training the model; words with an occurrence less than this count will be ignored.
- workers: (default 3) The number of threads to use while training.
- sg: (default 0 or CBOW) The training algorithm, either CBOW (0) or skip gram (1).
The defaults are often good enough when just getting started. If you have a lot of cores, as most modern computers do, I strongly encourage you to increase workers to match the number of cores (e.g. 8).
After the model is trained, it is accessible via the “wv” attribute. This is the actual word vector model in which queries can be made.
For example, you can print the learned vocabulary of tokens (words) as follows:
You can review the embedded vector for a specific token as follows:
Finally, a trained model can then be saved to file by calling the save_word2vec_format() function on the word vector model.
By default, the model is saved in a binary format to save space. For example:
When getting started, you can save the learned model in ASCII format and review the contents.
You can do this by setting binary=False when calling the save_word2vec_format() function, for example:
The saved model can then be loaded again by calling the Word2Vec.load() function. For example:
We can tie all of this together with a worked example.
Rather than loading a large text document or corpus from file, we will work with a small, in-memory list of pre-tokenized sentences. The model is trained and the minimum count for words is set to 1 so that no words are ignored.
After the model is learned, we summarize, print the vocabulary, then print a single vector for the word ‘sentence‘.
Finally, the model is saved to a file in binary format, loaded, and then summarized.
Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.
Running the example prints the following output.
You can see that with a little work to prepare your text document, you can create your own word embedding very easily with Gensim.
Visualize Word Embedding
After you learn word embedding for your text data, it can be nice to explore it with visualization.
You can use classical projection methods to reduce the high-dimensional word vectors to two-dimensional plots and plot them on a graph.
The visualizations can provide a qualitative diagnostic for your learned model.
We can retrieve all of the vectors from a trained model as follows:
We can then train a projection method on the vectors, such as those methods offered in scikit-learn, then use matplotlib to plot the projection as a scatter plot.
Let’s look at an example with Principal Component Analysis or PCA.
Plot Word Vectors Using PCA
We can create a 2-dimensional PCA model of the word vectors using the scikit-learn PCA class as follows.
The resulting projection can be plotted using matplotlib as follows, pulling out the two dimensions as x and y coordinates.
We can go one step further and annotate the points on the graph with the words themselves. A crude version without any nice offsets looks as follows.
Putting this all together with the model from the previous section, the complete example is listed below.
Running the example creates a scatter plot with the dots annotated with the words.
Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.
It is hard to pull much meaning out of the graph given such a tiny corpus was used to fit the model.
Load Google’s Word2Vec Embedding
Training your own word vectors may be the best approach for a given NLP problem.
But it can take a long time, a fast computer with a lot of RAM and disk space, and perhaps some expertise in finessing the input data and training algorithm.
An alternative is to simply use an existing pre-trained word embedding.
Along with the paper and code for word2vec, Google also published a pre-trained word2vec model on the Word2Vec Google Code Project.
A pre-trained model is nothing more than a file containing tokens and their associated word vectors. The pre-trained Google word2vec model was trained on Google news data (about 100 billion words); it contains 3 million words and phrases and was fit using 300-dimensional word vectors.
It is a 1.53 Gigabytes file. You can download it from here:
Unzipped, the binary file (GoogleNews-vectors-negative300.bin) is 3.4 Gigabytes.
The Gensim library provides tools to load this file. Specifically, you can call the KeyedVectors.load_word2vec_format() function to load this model into memory, for example:
On my modern workstation, it takes about 43 seconds to load.
Another interesting thing that you can do is do a little linear algebra arithmetic with words.
For example, a popular example described in lectures and introduction papers is:
That is the word queen is the closest word given the subtraction of the notion of man from king and adding the word woman. The “man-ness” in king is replaced with “woman-ness” to give us queen. A very cool concept.
Gensim provides an interface for performing these types of operations in the most_similar() function on the trained or loaded model.
For example:
We can put all of this together as follows.
Running the example loads the Google pre-trained word2vec model and then calculates the (king – man) + woman = ? operation on the word vectors for those words.
The answer, as we would expect, is queen.
See some of the posts in the further reading section for more interesting arithmetic examples that you can explore.
Load Stanford’s GloVe Embedding
Stanford researchers also have their own word embedding algorithm like word2vec called Global Vectors for Word Representation, or GloVe for short.
I won’t get into the details of the differences between word2vec and GloVe here, but generally, NLP practitioners seem to prefer GloVe at the moment based on results.
Like word2vec, the GloVe researchers also provide pre-trained word vectors, in this case, a great selection to choose from.
You can download the GloVe pre-trained word vectors and load them easily with gensim.
The first step is to convert the GloVe file format to the word2vec file format. The only difference is the addition of a small header line. This can be done by calling the glove2word2vec() function. For example:
Once converted, the file can be loaded just like word2vec file above.
Let’s make this concrete with an example.
You can download the smallest GloVe pre-trained model from the GloVe website. It an 822 Megabyte zip file with 4 different models (50, 100, 200 and 300-dimensional vectors) trained on Wikipedia data with 6 billion tokens and a 400,000 word vocabulary.
The direct download link is here:
Working with the 100-dimensional version of the model, we can convert the file to word2vec format as follows:
You now have a copy of the GloVe model in word2vec format with the filename glove.6B.100d.txt.word2vec.
Now we can load it and perform the same (king – man) + woman = ? test as in the previous section. The complete code listing is provided below. Note that the converted file is ASCII format, not binary, so we set binary=False when loading.
Running the example prints the same result of ‘queen’.
Further Reading
This section provides more resources on the topic if you are looking go deeper.
Gensim
- gensim Python Library
- models.word2vec gensim API
- models.keyedvectors gensim API
- scripts.glove2word2vec gensim API
Posts
- Messing Around With Word2vec, 2016
- Vector Space Models for the Digital Humanities, 2015
- Gensim Word2vec Tutorial, 2014
Summary
In this tutorial, you discovered how to develop and load word embedding layers in Python using Gensim.
Specifically, you learned:
- How to train your own word2vec word embedding model on text data.
- How to visualize a trained word embedding model using Principal Component Analysis.
- How to load pre-trained word2vec and GloVe word embedding models from Google and Stanford.
Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.
No comments:
Post a Comment