Introduction
Word2vec is an open source tool developed by a group of Google researchers led by Tomas Mikolov in 2013. It describes several efficient ways to represent words as M-dimensional real vectors, also known as word embedding, which is of great importance in many natural language processing applications. Word2vec also expresses the quality of the word representation by measuring the ‘similarity’ of words, both syntactically and semantically. Besides, it uses ‘analogy’ to measure this similarity, and algebraic operation can be performed on the word vectors. For example, the original paper describes an example of the ‘Additive Compositionality’ of word vectors:
vector(‘King’) – vector(‘Man’) + vector(‘Woman’) = vector(‘Queen’)
Word Representation
To solve the problem of understanding natural languages using machine learning methods, one of the most essential issues is how to turn the vast number of words in the real world into something our computers can process. Over a long period of time, words are most commonly represented as long vectors, with the vocabulary size as their dimensionality, and the location of the current word has the value of 1 while others are 0. This is called ‘One-hot Representation’. However, if the vocabulary size becomes very large, the vectors turn out to be incredibly long and sparse (given that most of the elements are 0), which is also likely to make our models more vulnerable to suffer from ‘the curse of dimensionality’.
Another method is known as Distributed Representation, which is actually what Word2vec uses. It represents the words as low-dimensional real-value vectors, and the vectors are obtained with the model training process. One of the features of Word2vec (and distributed representation) is that through measuring the ‘distance’ of any two word vectors, we have a clearer picture on how closely these two words are relevant.
Models
Word2vec proposed two kinds of model architectures to compute continuous vector representations of words: CBOW (Continuous Bag-of-Words) Model and Skip-Gram Model.
1. CBOW Model
CBOW is a three-layer architecture, which takes the ‘context’, or rather, a fixed number of words not only before a possible word but also after it, as the input and outputs the prediction on what this word is. What ‘bag-of-words’ means is that the model sums up all the input vectors in its projection layer and the order of these context words does not influence the result.

2. Skip-Gram Model
In contrast, the Skip-Gram model predicts the context based on the current word, despite its similar structure with CBOW. The projection layer, in fact, is not indispensable here, just for comparison with CBOW.

3. More about the algorithms
From the source code, we can see that Word2vec provides two kinds of algorithms in both CBOW and Skip-Gram: Hierarchical Softmax and Negative Sampling.
Hierarchical Softmax is a key technique which greatly improves the performance of Word2vec. It means that the output layer corresponds to a binary tree, or a Huffman tree to be precise. It is built according to the word count. The algorithm traverses down the tree and make binary classification on every non-leaf nodes, and each creates a probability by using the sigmoid function σ, until reaching the leaf nodes corresponding to every word in the vocabulary. In addition, Word2vec uses Huffman Code to distinguish the positive examples as 0 and the negative examples as 1. The optimization objective of the model is very similar with that of Logistic Regression, and can be trained using Stochasitic Gradient Descent method.

An alternative of Hierarchical Softmax in Word2vec is Negative Sampling. Briefly speaking, it also makes binary classification but it chooses some negative examples while the current word(s) serves as the positive example. Readers who are interested in this part can refer to the source code for more information.
if (cbow) { //train the cbow architecture // in -> hidden for (a = b; a < window * 2 + 1 - b; a++) if (a != window) { c = sentence_position - window + a; if (c < 0) continue; if (c >= sentence_length) continue; last_word = sen[c]; if (last_word == -1) continue; for (c = 0; c < layer1_size; c++) neu1[c] += syn0[c + last_word * layer1_size]; } if (hs) for (d = 0; d < vocab[word].codelen; d++) { f = 0; l2 = vocab[word].point[d] * layer1_size; // Propagate hidden -> output for (c = 0; c < layer1_size; c++) f += neu1[c] * syn1[c + l2]; if (f <= -MAX_EXP) continue; else if (f >= MAX_EXP) continue; else f = expTable[(int)((f + MAX_EXP) * (EXP_TABLE_SIZE / MAX_EXP / 2))]; // 'g' is the gradient multiplied by the learning rate g = (1 - vocab[word].code[d] - f) * alpha; // Propagate errors output -> hidden for (c = 0; c < layer1_size; c++) neu1e[c] += g * syn1[c + l2]; // Learn weights hidden -> output for (c = 0; c < layer1_size; c++) syn1[c + l2] += g * neu1[c]; } // NEGATIVE SAMPLING if (negative > 0) for (d = 0; d < negative + 1; d++) { if (d == 0) { target = word; label = 1; else { next_random = next_random * (unsigned long long)25214903917 + 11; target = table[(next_random >> 16) % table_size]; if (target == 0) target = next_random % (vocab_size - 1) + 1; if (target == word) continue; label = 0; } l2 = target * layer1_size; f = 0; for (c = 0; c < layer1_size; c++) f += neu1[c] * syn1neg[c + l2]; if (f > MAX_EXP) g = (label - 1) * alpha; else if (f < -MAX_EXP) g = (label - 0) * alpha; else g = (label - expTable[(int)((f + MAX_EXP) * (EXP_TABLE_SIZE / MAX_EXP / 2))]) * alpha; for (c = 0; c < layer1_size; c++) neu1e[c] += g * syn1neg[c + l2]; for (c = 0; c < layer1_size; c++) syn1neg[c + l2] += g * neu1[c] } // hidden -> in for (a = b; a < window * 2 + 1 - b; a++) if (a != window)。 { c = sentence_position - window + a; if (c < 0) continue; if (c >= sentence_length) continue; last_word = sen[c]; if (last_word == -1) continue; for (c = 0; c < layer1_size; c++) syn0[c + last_word * layer1_size] += neu1e[c]; } } else { //train skip-gram for (a = b; a < window * 2 + 1 - b; a++) if (a != window) { c = sentence_position - window + a; if (c < 0) continue; if (c >= sentence_length) continue; last_word = sen[c]; if (last_word == -1) continue; l1 = last_word * layer1_size; for (c = 0; c < layer1_size; c++) neu1e[c] = 0; // HIERARCHICAL SOFTMAX if (hs) for (d = 0; d < vocab[word].codelen; d++) { f = 0; l2 = vocab[word].point[d] * layer1_size; // Propagate hidden -> output for (c = 0; c < layer1_size; c++) f += syn0[c + l1] * syn1[c + l2]; if (f <= -MAX_EXP) continue; else if (f >= MAX_EXP) continue; else f = expTable[(int)((f + MAX_EXP) * (EXP_TABLE_SIZE / MAX_EXP / 2))]; // 'g' is the gradient multiplied by the learning rate g = (1 - vocab[word].code[d] - f) * alpha; // Propagate errors output -> hidden for (c = 0; c < layer1_size; c++) neu1e[c] += g * syn1[c + l2]; // Learn weights hidden -> output for (c = 0; c < layer1_size; c++) syn1[c + l2] += g * syn0[c + l1]; } // NEGATIVE SAMPLING if (negative > 0) for (d = 0; d < negative + 1; d++) { if (d == 0) { target = word; label = 1; } else { next_random = next_random * (unsigned long long)25214903917 + 11; target = table[(next_random >> 16) % table_size]; if (target == 0) target = next_random % (vocab_size - 1) + 1; if (target == word) continue; label = 0; } l2 = target * layer1_size; f = 0; for (c = 0; c < layer1_size; c++) f += syn0[c + l1] * syn1neg[c + l2]; if (f > MAX_EXP) g = (label - 1) * alpha; else if (f < -MAX_EXP) g = (label - 0) * alpha; else g = (label - expTable[(int)((f + MAX_EXP) * (EXP_TABLE_SIZE / MAX_EXP / 2))]) * alpha; for (c = 0; c < layer1_size; c++) neu1e[c] += g * syn1neg[c + l2]; for (c = 0; c < layer1_size; c++) syn1neg[c + l2] += g * syn0[c + l1]; } // Learn weights input -> hidden for (c = 0; c < layer1_size; c++) syn0[c + l1] += neu1e[c]; } } sentence_position++; if (sentence_position >= sentence_length) { sentence_length = 0; continue; } }
What is different for Word2vec?
Compared with the previous models for word embedding, such as NNLM (Neural Network Language Model) (Yoshua Bengio, 2013), Word2vec has several new features and improvements as follows:
1. In the projection layer, NNLM connects all m-dimensional word vectors (the number of them is n-1) together end-to-end to form an m(n-1)-dimensional long vector, while Word2vec calculates the sum of the word vectors.
2. Word2vec removes the computationally expensive non-linear hidden-layer.
3. Word2vec adopts Hierarchical Softmax and Negative Sampling to optimize the computation of the output layer, to speed up the training of the model, with the time complexity as O(log2(|V|)), compared with O(|V|) of the
traditional softmax.(V is the vocabulary size)

Quickly get started
- Download the code at http://word2vec.googlecode.com/svn/trunk/
- Run ‘make’ to compile the Word2vec tool.
- Run the ‘/demo-word.sh’.
Conclusion
Word2vec has made a splash after its release. It introduces efficient techniques that can learn high-quality word representation vectors by measuring both the syntactic and semantic similarity of words. Furthermore, rather than using complex neural networks, it shows the potential to train the models with very simple MLP with a large corpus. Both the high-quality word vectors and the insight of the model architectures will shed light on and bring benefits to many NLP applications in the future.
Reference
[1] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient Estimation of Word Representations in Vector Space. In Proceedings on Workshop in ICLR,2013. (Link:https://arxiv.org/abs/1301.3781)
Author: Kejin Jin | Localized by Synced Global Team: Junpei Zhong
I found some useful information in your blog, it was awesome to read, thanks for sharing this great content to my vision, keep sharing.
For more related to Skip-Gram Model