The Math Behind Word2Vec Skip-Gram Training

A comprehensive linear algebraic representation of the original algorithm for math-minded developers

Motivations

(1) While there are some great tutorials explaining Word2Vec embedding concept and its applications for downstream Natural Language Processing (NLP) tasks, a comprehensive analytical representation of the original work is not readily available online. Many NLP developers who already know the Word2Vec concept, still need to derive the formulas for loss functions and gradient matrices for a computationally efficient software implementation on their hardware platforms such as CPUs and GPUs.

Scope: Skip-Gram Language Model

The original Word2Vec paper proposed two types of language models for learning the word embeddings: (1) Continuous Bag of Words (CBOW); and (2) Skip-Gram. The former predicts the probability of observing the context words given a center word. However, the latter predicts the probability of observing the center word given its context words. Each model can be implemented in two ways: (1) Naive Softmax; and (2) Negative Sampling (computationally more efficient). In this article, we focus on skip-gram model and cover both implementations.

1. Naive Softmax

Definition: Vectorized Softmax function for a (1,n)-dimensional vector z returns a (1,n)-dimensional vector like below:

Image for post
Image for post

1.1 Input(s) and Output(s)

Input: center word’s one-hot vector of size (|V|, 1) with 1 at the center word index in the vocabulary and 0 elsewhere.

Image for post
Image for post

1.2 Loss function

Loss function is actually the sum of mutual cross-entropies between the predicted output and each of the 2m target outputs which can be simplified to:

Image for post
Image for post

1.3 Gradients

We basically need to calculate the two gradients matrices of the loss function above (i.e., J which is a scalar value) with respect to matrices V and U. Each gradient matrix has the size of (d, |V|). The first gradient matrix is the one with respect to V:

Image for post
Image for post
Image for post
Image for post
Image for post
Image for post

1.4 Compute Complexity

The Softmax function — which is called in each SGD iteration to calculate the predicted output vector — involves O(|V|) operations. This make the Softmax implementation of skip-gram computationally inefficient. Remember that, in practical cases, a vocabulary can contain from 30,000 to several million words.

2. Negative Sampling

Unlike the Naive Softmax implementation of Skip-Gram that uses Softmax functions to model discrete probability distributions, Negative Sampling uses vectorized Sigmoid functions to model individual binary probabilities:

Image for post
Image for post
Image for post
Image for post

2.1 Input(s) and Output(s)

Input: center word’s one-hot vector of size (|V|, 1) with 1 at the center word index in the vocabulary and 0 elsewhere.

Image for post
Image for post

2.2 Loss function

Loss function is actually the sum of mutual cross-entropies between the distribution derived from the predicted output vector and each of the 2m target outputs. The final result can be simplified to:

Image for post
Image for post

2.3 Gradients

Similarly, we need to calculate the two gradients matrices of the loss function above (i.e., J which is a scalar value) with respect to matrices V and U. Each gradient matrix has the size of (d, |V|). The first gradient matrix is the one with respect to V:

Image for post
Image for post
Image for post
Image for post

2.4 Compute Complexity

Unlike the Naive Softmax implementation that each SGD iteration updates all |V| vectors in matrix U, Negative Sampling updates only K+1 vectors in each iteration. Therefore the complexity is O(K) which is much less than O(|V|). Remember that K is usually chosen to be less than 30 .

3. Mathematical derivation of the Loss functions and Gradients

To not make the readers too overwhelmed, I put all the mathematical proofs for the loss function and gradients in this separate document written as a companion to this blog.

Acknowledgement

Some of the notations and discussions in this article is inspired by the notes and materials of the Stanford CS224N course lectured by Prof. Christopher D. Manning in 2020.

References

[1] Tomas Mikolov, et al. 2013b. Distributed representations of words and phrases and their compositionality. In NIPS, pages 3111–3119.

Sr. Data scientist with focus on modern NLP techniques and large language models, an entrepreneur, currently at Nvidia, ex-CTO/CEO, EE PhD, Wharton MBA.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store