Introduction
For machines, understanding the meaning of words and sentences is a complex task because it requires considering not only the definition of words, but also their connotation, relationships with other words and the way they interact with the context. The study of this problem belongs to the field of Natural Language Processing (NLP) which serves many purposes such as, for instance, extracting information from a given text. You can freely test a NLP model trained by the CATIE experts.
Natural language processing dates back to the early days of computing, the 1950s, when experts sought to represent words digitally. In the 2010s, the increasing power of computers enabled the popularization of neural networks, leading to the emergence of vector representation (associating a word with a sequence of several hundred numbers). Indeed, most machine learning models use vectors as training data.
Word embedding models aim to capture the relationships between words in a corpus of texts and translate them into vectors. In this article, we will analye the Word2Vec model to explain how to interpret these vectors and how they are generated.
Words Arithmetic
One way of interpreting word vectors is to think of them as coordinates. Indeed, word embedding models translate the relationships between words into angles, distances and directions. For example, to evaluate the semantic proximity between 2 words, one can simply calculate the cosine of the angle between the 2 corresponding vectors: a value of 1 (angle of 0°) corresponds to synonyms, while a value of -1 indicates antonyms (angle of 180°).
More complex relationships between words can also be compputed. Figure 1 shows the projection of some word vectors into a 3-dimensional space (before projection, vectors have hundreds of dimensions). We can see that the vector from queen to king is quite similar to the vectors from female to male or mare to stallion, suggesting that it characterizes the female-male relationship. Similarly, the relationship between Paris and France is similar to the one between Berlin and Germany:
which is equivalent to:
\[Paris = Berlin - Germany + France\]so one may find Canada’s capital city by computing:
\[Berlin - Germany + Canada\]It is possible to try the words arithmetic on the Polytechnique website. However, you should keep in mind that no model is perfect, and some results of arithmetic operations may be incorrect.
Word2Vec
Google researchers (Mikolov et al.) developed Word2Vec in 2013, popularizing wording embedding technology thanks to its simplicity and effectiveness. Although other word embedding models have since been developed (notably GloVe and FastText), Word2Vec remains widely used and cited in the scientific literature.
Some definitions
Context: Given a text, the context of a word is defined as all the words in its vicinity, at the various points in the text where it appears. The vicinity is associated with a window: a window of size 3 encompasses the 3 words preceding and the 3 words following the target word.
Vocabulary: (Sub)Set of words that appear in a text. For example, given the text "The sister of my sister is my sister", the associated vocabulary would contain at most the following words: "the", "sister", "of", "my", "is".
One-hot encoding: Given a vocabulary of size N, the one-hot encoding of a word in this vocabulary consists in creating a vector of size N with N-1 zeros and 1 one corresponding to the position of the word in the vocabulary. For example, with the vocabulary {"the", "sister", "of", "my", "is"}, the one-hot vector corresponding to "sister" would be [0, 1, 0, 0, 0].
The way it works
The concept behind Word2Vec is to use a neural network to solve a "fake task", called a pretext task: the weights obtained after training are not used to infer results, but are the result, i.e word vectors. The model comes in 2 (slightly) different versions: CBOW (for Continuous Bag Of Words) and Skip Gram. CBOW tries to solve the task of associating a word with a given context, while Skip Gram does the opposite. We will focus on the Skip-Gram model, as the method is similar for both versions.
Given a text and a window size, the following task is defined: given a word in the text (the input), compute for each other word the probability that it is in the input's context. To solve this task, we use a neural network consisting of:
- The input layer, with the word being encoded as a one-hot vector
- A hidden layer, of arbitrary size, totally connected to the input
- The output layer, i.e. a vocabulary-long probability vector, totally connected to the hidden layer
For example, with the text "Vacations in Nouvelle Aquitaine are dope, we should go to the Futuroscope", and a window of size 1, figure 2 illustrates how the model's training data is produced:
With the same example, figure 3 represents a neural network trained with the previously generated training data.
After the model was trained, only the input's weights matter: in our case, a 12-row (one by word) and 3 column (size of the hidden layer) matrix, as shown in figure 4. Each line corresponds to a word vector.
Note that in our example, the outputs are fairly predictable, because each word appears only once in the text. In reality, the text corpora used comprise at least a few thousand words. There should therefore be a high probability that nouvelle is in the vicinity of aquitaine, as these words are often associated.
The word vectors thus produced are relevant as 2 similar words will be associated with 2 close vectors. Logically, 2 synonyms should indeed have a similar context, which translates into 2 almost equal outputs for these 2 inputs. The model will therefore assign almost identical weights to the 2 inputs, resulting in 2 close vectors.
Applications and limits
As mentioned in the introduction, word embedding models can generate vectors for training more sophisticated NLP models. They can also be used to solve simple tasks, being resource-efficient, easily trainable and explainable. For example, word similarity can be used in a search engine to replace a keyword with another, or to extend the keywords list based on their context. Thanks to vectors, it is also possible to study the connotation of words in a text to highlight biases linked to stereotypes; cf. Garg et al. (2018).
Word embedding models also have applications outside the field of language processing. Indeed, instead of vectorizing words using the text from which they are derived as a context, we can vectorize products from a marketplace using users' purchase history as a context, in order to recommend similar products; cf. Grbovic et al. (2015).
The main limitation of word embedding models is that it does not take into account the polysemy of a word: for example, given the text "The bank is located on the right bank", the model will only create a single vector for the word "bank". Another drawback is the corpus pre-processing work to be carried out upstream: we need to define a vocabulary, that is to say remove words that are too repetitive (there, is, a...) and potentially remove agreements (is it desirable for "word" and "words" to each have their own vector?).
The latest language models (GPT, Bloom, Llama...) based on transformers overcome these limitations. They can be trained directly on texts. They also use more sophisticated vectors, which represent a word and its context, enabling them to distinguish the different meanings of a word.
Conclusion
Word embedding techniques have revolutionized NLP technologies by providing simple, inexpensive models with impressive results. While transformers are replacing these models in most applications, there are still some cases where they remain relevant. In a forthcoming article on the Vaniila blog, you will discover a concrete application of word embedding, through a CATIE project that you can try yourself!
References
- Efficient Estimation of Word Representations in Vector Space by Mikolov et al. (2013),
- Word2Vec Tutorial - The Skip-Gram Model by McCormick (2016),
- Word embeddings quantify 100 years of gender and ethnic stereotypes by Garg, Schiebinger, Jurafsky and Zou (2018),
- E-commerce in your inbox: Product recommendations at scale by Grbovic, Radosavljevic, Djuric, Bhamidipati, Savla, Bhagwan and Sharp (2015)