Word2vec is arguably the most famous face of the neural network natural language processing revolution.
Word2vec provides direct access to vector representations of words, which can help achieve decent performance across a variety of tasks machines are historically bad at. For a quick examination of how word vectors work, check out my previous article about them.
Now we’re going to focus on how word2vec works in the first place. We’ll provide historical context explaining why this approach is so different, then delve into the network and how it works.
A Brief History
Natural language processing tasks have changed dramatically over the past few years, largely thanks to machine learning. As linguistics researchers in the ’80s moved away from transformational grammar — a theoretical approach that argues certain hard rules are responsible for generating grammatically-correct sentences — the floor opened for statistical approaches.
Many of these methods had early successes, but still embraced a rules-based approach at some level. Decision trees, for example, essentially just discover different rules for stringing words together. Markov chains — another early and popular variant for text generation — rely on a flat probability of observing the next word in a sequence given all prior words. In other words, they create a rule for what word should come after a given sequence.
Naturally, this approach reached its limit. The probabilities it uncovers tell us nothing about what a word actually means. Generally speaking, statistical NLP hit a wall.
Circa 2010, neural networks blew open natural language processing with state-of-the-art performance on text generation and analysis. Neural nets can discover non-linear and non-rules-based solutions, so they were able to capture the ephemeral qualities of human language by representing context. That is precisely what word2vec gives us.
How It Works
There are basically two versions of word2vec: skip-gram and CBOW. We won’t worry about CBOW for now since skip-gram achieves better performance on most tasks. (If you’re still interested, there’s plenty of material detailing the second approach).
Skip-gram attempts to predict the immediate neighbors of a given word. It looks one word ahead and one behind, or two ahead, or some other variation. We use the word we want to model as our input X, and the surrounding words as target output Y.
Once we can predict the surrounding words with a fair degree of accuracy, we remove the output layer and use the hidden layer to get our word vectors. It’s a cool little bit of hackery that yields very interesting results.
Let’s see if we can represent that flow graphically!
It’s actually that easy. To recap, we start with a ‘fake’ training task, attempting to measure the probability of observing different words in the same context as our given word. Our input is the word we wish to model. The target output is one of the words that appears next to it. For example, if our context is one word behind and one word ahead, we get the following training examples for “fox” given the following sentence:
The quick red fox jumped over the lazy brown dog.
Example 1: (fox, red)
Example 2: (fox, jumped)
This gives us two different training examples. The first one will have an input one-hot vector for the word ‘fox’ and a target output will have a one-hot vector for ‘red’. The next example would use one-hot vectors for ‘fox’ and ‘jumped’ for the input and target, respectively.
You might have picked up on the fact that we’re only using one word from the context at a time, which seems strange. Wouldn’t we want to model the full context for a given word to really understand what it means?
Not so. That’s what CBOW does, and that method is more performant on small data sets. But as our dataset grows, skip-gram outperforms it in most cases.
The tradeoff is that we will penalize the model unfairly. In other words, we tell it to adjust for words that actually appear as if they didn’t. ‘Jumped’ is in the same context as ‘fox’, but we don’t know that when we train on the target word ‘red’ since the one-hot vector for ‘red’ explicitly states that ‘jumped’ isn’t present. When we train with ‘red’ as our target, we will adjust the model as if ‘jumped’ weren’t there. This pushes down our supposed probability of observing ‘jumped’ when ‘fox’ appears.
The gambit is that we should have enough data in order to make up for those incorrect adjustments. Eventually, we converge to a probability that is good enough. And if you’re attempting to build a word2vec model, you should probably have a lot of data anyway.
Looking Forward with Word2vec
That’s an idiot’s guide to word2vec. Tensorflow, Gensim, and other implementations for Python make it pretty easy to fire up a word2vec model and get cracking with text analysis, so check those out if you’re interested in exploring the topic further.
Most interestingly, there are many variations on word2vec. Namely, data scientists can use word2vec for a wide variety of tasks completely unrelated to text analysis. We will discuss in a later post — stay tuned.