The Math Behind Word2Vec Skip-Gram Training

A comprehensive linear algebraic representation of the original algorithm for math-minded developers

7 min readJan 13, 2021

Motivations

(1) While there are some great tutorials explaining Word2Vec embedding concept and its applications for downstream Natural Language Processing (NLP) tasks, a comprehensive analytical representation of the original work is not readily available online. Many NLP developers who already know the Word2Vec concept, still need to derive the formulas for loss functions and gradient matrices for a computationally efficient software implementation on their hardware platforms such as CPUs and GPUs.

(2) Recently, Transformer-based language models and embeddings such as BERT have shown state-of-the-art performances. Several NLP scientists argue that Transformer-based pre-trained models (which are trained on some generic corpora such as Wikipedia and BookCorpus) underperform on some domain-specific downstream NLP tasks such as Named Entity Recognition or Dialogue Systems for biomedical texts (see the Doctor GPT-3 example in References section). Therefore, some companies may need to train their own embeddings based on their own corpora (likely with billions of tokens). However, not all developers have the supercomputers needed for such expensive model trainings. Having interacted with several professional NLP developers, the author observed this problem and that some developers may prefer to start with training non-Transformer-based embeddings (e.g., Word2Vec, GloVe) on their domain-specific corpora.

Readers are encouraged to refer to the original paper and online tutorials for introduction on Word2Vec. This article skips repeating those materials.

Scope: Skip-Gram Language Model

The original Word2Vec paper proposed two types of language models for learning the word embeddings: (1) Continuous Bag of Words (CBOW); and (2) Skip-Gram. The former predicts the probability of observing the context words given a center word. However, the latter predicts the probability of observing the center word given its context words. Each model can be implemented in two ways: (1) Naive Softmax; and (2) Negative Sampling (computationally more efficient). In this article, we focus on skip-gram model and cover both implementations.

Reminder: In Word2Vec, we iterate through the corpus and set each word as a center word one at a time. In each iteration, we form a window of surrounding “2m” context words to the center word (i.e., m context word on each side of the center word).

Skip-gram model learns two (d,1)-dimension embedding word vectors for each word w in its vocabulary with |V| number of words: center word vector v, and context word vector u. Therefore, it ultimately learns two (d, |V|) matrices: U and V. The center word vectors in the Skip-gram model are generally used as the representation vector of a word.

Side note: |V| is the cardinality of the vocabulary and an abbreviation for |Vocabulary|; whereas V is the matrix of all center word vectors.

1. Naive Softmax

Definition: Vectorized Softmax function for a (1,n)-dimensional vector z returns a (1,n)-dimensional vector like below:

Skip-Gram uses this function to model discrete probability distributions. Note that Softmax preserves the dimension of its input vector.

1.1 Input(s) and Output(s)

Input: center word’s one-hot vector of size (|V|, 1) with 1 at the center word index in the vocabulary and 0 elsewhere.

Predicted Output: conditional probability vector of size (|V|, 1).

Note: Skip-gram computes the probability for each word of appearing in the context independently of its distance to the center word. So, we don’t need to compute 2m probability vectors.

Target Outputs: 2m one-hot vectors of size (|V|, 1) with 1’s at context words’ indices in the vocabulary and 0‘s’elsewhere.

1.2 Loss function

Loss function is actually the sum of mutual cross-entropies between the predicted output and each of the 2m target outputs which can be simplified to:

1.3 Gradients

We basically need to calculate the two gradients matrices of the loss function above (i.e., J which is a scalar value) with respect to matrices V and U. Each gradient matrix has the size of (d, |V|). The first gradient matrix is the one with respect to V:

And, the second gradient matrix is the one with respect to U:

Now, we have all the information for a Stochastic Gradient Descent (SGD)optimization to find the U and V matrices that minimize the loss function J. In every step of the iteration, we iteration through the entire text and pick one random center point at a time and update the the U and V matrices accordingly:

1.4 Compute Complexity

The Softmax function — which is called in each SGD iteration to calculate the predicted output vector — involves O(|V|) operations. This make the Softmax implementation of skip-gram computationally inefficient. Remember that, in practical cases, a vocabulary can contain from 30,000 to several million words.

2. Negative Sampling

Unlike the Naive Softmax implementation of Skip-Gram that uses Softmax functions to model discrete probability distributions, Negative Sampling uses vectorized Sigmoid functions to model individual binary probabilities:

The elements in vectorized Sigmoid’s output vector do not add up 1 and do not constitute a probability distribution.

For every pair of a center word and a context word, we draw K random words (a.k.a negative samples) out of the vocabulary excluding the context word. In other words, if we pick a different context word for the same center word, we need to new K random negative samples.

Negative Sampling has one extra tweak relative to Naive Softmax: Given a center word, it predicts the probability of observing its context word and that the context word is not a negative sample. While this assumption brings about a significant computational benefit (we will prove later); however, it incorporates some estimation error to that of Naive Softmax.

2.1 Input(s) and Output(s)

Input: center word’s one-hot vector of size (|V|, 1) with 1 at the center word index in the vocabulary and 0 elsewhere.

Predicted Output: conditional probability vector of size (|V|, 1).

Target Outputs: 2m one-hot vectors of size (|V|, 1) with 1's at context words’ indices in the vocabulary and 0‘s’elsewhere.

2.2 Loss function

Loss function is actually the sum of mutual cross-entropies between the distribution derived from the predicted output vector and each of the 2m target outputs. The final result can be simplified to:

2.3 Gradients

Similarly, we need to calculate the two gradients matrices of the loss function above (i.e., J which is a scalar value) with respect to matrices V and U. Each gradient matrix has the size of (d, |V|). The first gradient matrix is the one with respect to V:

And, the second gradient matrix is the one with respect to U:

2.4 Compute Complexity

Unlike the Naive Softmax implementation that each SGD iteration updates all |V| vectors in matrix U, Negative Sampling updates only K+1 vectors in each iteration. Therefore the complexity is O(K) which is much less than O(|V|). Remember that K is usually chosen to be less than 30 .

3. Mathematical derivation of the Loss functions and Gradients

To not make the readers too overwhelmed, I put all the mathematical proofs for the loss function and gradients in this separate document written as a companion to this blog.

Acknowledgement

Some of the notations and discussions in this article is inspired by the notes and materials of the Stanford CS224N course lectured by Prof. Christopher D. Manning in 2020.

References

[1] Tomas Mikolov, et al. 2013b. Distributed representations of words and phrases and their compositionality. In NIPS, pages 3111–3119.

[2]CS224n: Natural Language Processing with Deep Learning, Prof. Christopher D. Manning, Stanford, Winter 2021,

[3] Ashish Vaswani, et al. 2017. Attention is all you need. In Advances in NIPS, pages 6000–6010.

[4] Tomas Mikolov, et al. 2013a. Efficient Estimation of Word Representations in Vector Space. In ICLR Workshop Papers.

[5] Doctor GPT-3: hype or reality?. 2020. Nabla.