Efficient Estimation of Word Representations in Space

Efficient Estimation of Word Representations in Space – Mikolov et al

Click to access 1301.3781.pdf

Quick summary
This paper introduced what is now pretty commonly known as word2vec. It shows that a very simple neural network with a language modeling objective can be used to learn word embeddings that have interesting similarity properties e.g. you can do vector arithmetic to move in the semantic space i.e. King – Man + Woman = Queen. The primary differentiator between this work and previous work is that the computational complexity of the model is low and it is able to learn from billions of tokens efficiently.

Why I liked reading it
This paper took me behind the curtain of the incredibly popular word2vec library. I was surprised to find how simple the models were given the impact in the field of NLP has been big.

Core ideas
Distributional similarity : Words that have similar contexts have similar meanings.

Detailed Summary

  • This paper builds up on the neural network language model (NNLM) proposed by Bengio in 2001. In that model, it is expensive to compute the mapping from the projection layer to the hidden layer and from the hidden layer to the output layer which makes it prohibitive for running on large vocabularies.
  • The model complexity evaluation is interesting and novel to me. For feedforward NNLM, the complexity is N x D + N x D x H + H x V, where the first term is for mapping from the input layer to the projection layer, the second term is for mapping from the projection layer to the hidden layer and the third term is for mapping hidden layer to the output layer. Using some variations to the softmax, the output layer complexity can be reduced to logV so the dominating term is N x D x H.
  • For RNN LM, the complexity is H x H + H x V, where H x V is the dominating term but with heirarchical softmax, the dimensionality of the output can be reduced to log(V) so the dominating term is H x H.
  • The models proposed in this paper are super simple. The non linear activation function is removed. There are two models :
    • Continuous bag of words (CBOW) : All the context words (both before and after the target word) are projected to an embedding space and averaged. The projection undergoes a linear transformation to predict the target (middle) word.
    • Continuous skip gram : Tries to predict context words based on the middle word.
  • Models were trained on Google News corpus which contains about 6 billion tokens. The authors varied the dimensionality of the word vectors and the amount of tokens used for training and found that performance increased on increasing either factor until it reaches a point of diminishing returns.
  • An interesting thing was that the models were trained on a distributed framework called DistBelief by running multiple replicas of the same model in parallel where each replica synchronizes its gradient updates through a centralized server that keeps all the parameters.
  •  Both CBOW and Skip gram outperform the NNLM and RNNLM on the Semantic-Syntactic Word Relationship test set.
  • Perhaps the most interesting thing about the results is the types of semantic relationship learned. Some of the interesting examples :
    • Paris – France + Italy = Rome
    • big – bigger + cold = colder
    • Einstein – scientist + Messi = midfielder
    • Japan – sushi + Germany = bratwurst