Deep contextualized word representations (aka ELMo)

https://arxiv.org/abs/1802.05365

Quick Summary :
ELMo (Embeddings from Language Models) representations are a type of context dependent word embeddings. This means that the embedding for each word in the input sentence depends on the context i.e. the words around it. They are able to deal with polysemy (words with different meanings based on context e.g. play as a noun vs play as a verb) unlike context independent word embeddings like word2vec, GloVe etc.. It is common to use pretrained context independent word embeddings like word2vec as input to a RNN for tasks like sentence classification, PoS tagging etc. In this process, the hidden layer of the RNN essentially produces a context dependent word representation at each time step. Since this process usually takes place in a supervised setting, the amount of data can be a limiting factor. By pretraining these context dependent word representations in an unsupervised setting instead, we can obtain much richer representations that can then be used directly as input for downstream tasks. The downstream task models then use a limited amount of labeled data to convert these into task specific embeddings.
This idea is not a new one. The difference between the previous works and ELMo looks like a small but important one — previous works use a multi layer LSTM and then pick the final layer output as the embedding, whereas ELMo uses a linear combination of all the layers. The weights for combining the layers are learned while training for a downstream task. The reason it works better is because it has been shown that different layers of the LSTM are able to capture different linguistic phenomena e.g. the lower layers are able to predict PoS tags better whereas the higher ones are able to learn word sense. So by combining all layers, we are leveraging the specific abilities of each one.

 

The core idea : 
Generate context dependent word vectors using learned functions of the internal states of a deep bidirectional language model (biLM), which is pretrained on a large text corpus.

 

Why I liked reading it :
This was an introduction to context dependent word embeddings for me. I think the general idea is super cool because it pushes a huge chunk of the learning from the supervised into the unsupervised domain.

 

Detailed Summary :
  • The most commonly used word embeddings today like word2vec and GloVe are context independent are fail to capture polysemy. For instance, the word “play” will have only one embedding vector, even though it could have different meanings based on context. This paper proposes a type of context dependent embedding i.e. the embedding of a word is a function of the entire sentence. My first thought was that there are infinite number of possible contexts so it’s impossible to provide pretrained context dependent embedding vectors. Sure enough, it doesn’t look like there’s a dictionary of vectors that one can use in this case. Instead, to use these embeddings, one must run the input word through a pretrained network to obtain the embedding.
  • Similar to word2vec, the network is pretrained on a language modeling objective. This is why they are named Embeddings from Language Models (ELMo).
  • The model first runs a CNN over the input word’s characters to obtain a context independent representation of the word and then pass them through L layers of bidirectional LSTMs. The reason for using a bidirectional LSTM is that it captures context both before and after the target word. Each LSTM layer can be thought of as producing a context dependent representation. Including both directions and the context independent representation produced by the CNN, there are 2L + 1 representations for each token. The final ELMo embedding is obtained by taking a linear combination of these 2L + 1 embeddings. The weights for the linear combination can be learned by the end task model in a supervised setting.
  • During pretraining, the top layer embedding is used to predict the next word using a softmax layer. In the final model, L is set to 2. It’s not clear why 2 was chosen and there’s no analysis on greater values of L.
  • To evaluate the embeddings, the authors feed them into the state of the art models for 6 different tasks – Question Answering on the SQuAD dataset, Textual Entailment on the SNLI dataset, Semantic role labeling (SRL) on the OntoNotes benchmark, Coreference Resolution on the CoNLL 2012 shared task, NER on the CoNLL 2003 shared task and Sentiment Analysis on the Stanford Sentiment Treebank. The authors find that the ELMo embeddings help achieve SoTA results on all 6 tasks.
  • The authors perform some ablation analysis to show that combining all layers of the LSTM helps performance. They do this by comparing the baseline with only the top layer of the LSTM and using a combination of all layers. They also show that the first LSTM layer is better at prediction PoS tags than the final layer (by 1.6 F1 points) and the final layer is better at Word Sense Disambiguation than the first (by 0.5 F1 points). This leads to the conclusion that the first layer captures syntactic information better than the second and the second captures semantic information better, which explains why it’s better to combine all layers. It would be interesting to see how this behaviour changes with more layers.
  • Another interesting benefit of ELMo embeddings is that it increases the sample efficiency of task models i.e. models with ELMo need fewer examples and parameters to achieve SoTA results. There was no reason given for this behaviour in the paper.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s