Building a chatbot for ConvAI 2018

I decided to try building a chatbot for the ConvAI NeurIPS challenge 2018. The expectation was not to produce something competitive but to simply learn what it takes to build a chatbot. This post summarizes what I built and the process.

The Personachat dataset was provided for use. Details about the dataset are in this paper. It has about 11k dialogs crowdsourced from Mechanical Turk.

The dataset and some baseline models were made available in Facebook’s ParlAI platform. I decided to use ParlAI for model training and evaluation since it provided in built integration with the dataset as well as standard evaluation scripts.

The Personachat paper provided some baseline models primarily of two types : 1. retrieval based i.e. the model picks top response from a list of known responses and 2. generative i.e. the model generates the response tokens from scratch. I decided to build a generative model since I had already played around with text generation. I ran a series of experiments to try to beat the baseline models.

Experiment 1 : Language model
I started off using a single hidden layer LSTM with an embedding layer on top of the one hot encoded input and a linear transformation on top of the hidden layer to produce an output token. The ParlAI platform proved to be super unwieldy and I had to make a lot of minor code changes to get good results e.g. my dictionary was not lowercased by default which increase the size of the dict by a lot, my preprocessing settings during testing were different from training. I used cross entropy as the loss function and trained until perplexity got worse on the validation set which resulted in a test perplexity of 90.

Experiment 2 : Language model trained on a different dataset
I tried training on the Cornell dataset instead of the convai2 dataset because I was curious how that would impact performance. I was only able to reach 400 test perplexity which was way worse. I realized that while trying to transfer my language model learned on the Cornell dataset to the convai2 task, one problem will be that the vocab might not include key words from the convai2 task like person1, person2.

Experiment 3 : Language model with GloVe initialization
I then added GloVe vectors to init my language model. This was based on the realization that my plain language model was essentially trying to learn embeddings during training so pretrained embeddings should only make it better. This turned out to be true and improved the test perplexity to 62.

To further improve my model, a couple of ideas were using a bidirectional LSTM and attention. While trying to figure out how to use attention, I looked at this example : but it uses a separate encoder and decoder. And there’s a fundamental difference in the way input is fed into the seq2seq vs the language model. I confirmed what my current language model does and it just sends a stream of text as input which includes both the person1 and person2 utterance e.g. “person1 how are you doing ? person2 I am good . How are you ? person1 I am well thanks . What do you do ? person2 I work at a bank ”. This whole stream of text will be sent as input and the target will be the same thing shifted left by 1 word.

In the seq2seq case on the other hand, we would feed the person1 sentence as input with the person2 sentence as target.
So it doesn’t really make sense to use attention with my language model because attention only really makes sense in the encoder decoder framework.

Experiment 4 : Use seq2seq
I simply copied over the for the language model and replaced the language model agent with ParlAI’s inbuilt seq2seq agent. It just worked out of the box! It was able to consume the convai2 data. I don’t think validation worked though. I looked at what goes in as input to the seq2seq model. I’d thought it would be a person1 utterance as input and person2 utterance as output but actually, what happens is that a fixed number of words (100 in this case) of the entire conversation history is sent as input upto the latest utterance. The target is the next utterance.

e.g. input : “do for fun ? ? enjoy you do what . ? makeup about all learning all learning i like to hike and canoe . i am not as busy after graduating . . . chat crazy my for apologize i i yeah this chat is weird ! how old are you ? i am i in my thirties cool , i am in my twenties . what kind of work do you do ? i also take beauty classes . youtube . youtube makeup tutorials . . oh that is so cool ! what do you do since graduating ?”
output : “i work at a restaurant right now , but eventually i would like to teach ! __END__ __NULL__ __NULL__ __NULL__ __NULL__”

This actually made sense because the entire conversation history should inform the next utterance prediction.

The best validation ppl during training was 65.7. This is much worse than the language model with GloVe but I’d guess roughly the same or perhaps better than the language model without glove.

Conclusion : The best model I ended up with was the LSTM language model with GloVe. Here’s a sample conversation with the model :

Enter Your Message: hello how are you
PERSON1 hello how are you
[LanguageModel]: person2 i ‘m doing well . how are you ?
Enter Your Message: im good thanks . I like your shirt .
PERSON1 im good thanks . I like your shirt .
[LanguageModel]: person2 i love to read . i love to read .
Enter Your Message: that is good to know . What are you reading right now ?
PERSON1 that is good to know . What are you reading right now ?
[LanguageModel]: person2 i ‘m going to school for a new job .
Enter Your Message: that is great ! what program are you in ?
PERSON1 that is great ! what program are you in ?
[LanguageModel]: person2 i ‘m a student . i ‘m a student .
Enter Your Message: what are you studying ?
PERSON1 what are you studying ?
[LanguageModel]: person2 i ‘m a student . i ‘m a teacher .
Enter Your Message: how can you be both ?
PERSON1 how can you be both ?
[LanguageModel]: person2 i ‘m not sure i ‘m not sure

It’s clearly not great but seems to give decent answers  to some generic questions. Some things I’d like to work on would be to use ParlAI’s integration with Facebook Messenger to use to test the chatbot and try a hybrid retrieval and generation based model.

Code :


Efficient Estimation of Word Representations in Space

Efficient Estimation of Word Representations in Space – Mikolov et al

Click to access 1301.3781.pdf

Quick summary
This paper introduced what is now pretty commonly known as word2vec. It shows that a very simple neural network with a language modeling objective can be used to learn word embeddings that have interesting similarity properties e.g. you can do vector arithmetic to move in the semantic space i.e. King – Man + Woman = Queen. The primary differentiator between this work and previous work is that the computational complexity of the model is low and it is able to learn from billions of tokens efficiently.

Why I liked reading it
This paper took me behind the curtain of the incredibly popular word2vec library. I was surprised to find how simple the models were given the impact in the field of NLP has been big.

Core ideas
Distributional similarity : Words that have similar contexts have similar meanings.

Detailed Summary

  • This paper builds up on the neural network language model (NNLM) proposed by Bengio in 2001. In that model, it is expensive to compute the mapping from the projection layer to the hidden layer and from the hidden layer to the output layer which makes it prohibitive for running on large vocabularies.
  • The model complexity evaluation is interesting and novel to me. For feedforward NNLM, the complexity is N x D + N x D x H + H x V, where the first term is for mapping from the input layer to the projection layer, the second term is for mapping from the projection layer to the hidden layer and the third term is for mapping hidden layer to the output layer. Using some variations to the softmax, the output layer complexity can be reduced to logV so the dominating term is N x D x H.
  • For RNN LM, the complexity is H x H + H x V, where H x V is the dominating term but with heirarchical softmax, the dimensionality of the output can be reduced to log(V) so the dominating term is H x H.
  • The models proposed in this paper are super simple. The non linear activation function is removed. There are two models :
    • Continuous bag of words (CBOW) : All the context words (both before and after the target word) are projected to an embedding space and averaged. The projection undergoes a linear transformation to predict the target (middle) word.
    • Continuous skip gram : Tries to predict context words based on the middle word.
  • Models were trained on Google News corpus which contains about 6 billion tokens. The authors varied the dimensionality of the word vectors and the amount of tokens used for training and found that performance increased on increasing either factor until it reaches a point of diminishing returns.
  • An interesting thing was that the models were trained on a distributed framework called DistBelief by running multiple replicas of the same model in parallel where each replica synchronizes its gradient updates through a centralized server that keeps all the parameters.
  •  Both CBOW and Skip gram outperform the NNLM and RNNLM on the Semantic-Syntactic Word Relationship test set.
  • Perhaps the most interesting thing about the results is the types of semantic relationship learned. Some of the interesting examples :
    • Paris – France + Italy = Rome
    • big – bigger + cold = colder
    • Einstein – scientist + Messi = midfielder
    • Japan – sushi + Germany = bratwurst

Deep contextualized word representations (aka ELMo)

Quick Summary :
ELMo (Embeddings from Language Models) representations are a type of context dependent word embeddings. This means that the embedding for each word in the input sentence depends on the context i.e. the words around it. They are able to deal with polysemy (words with different meanings based on context e.g. play as a noun vs play as a verb) unlike context independent word embeddings like word2vec, GloVe etc.. It is common to use pretrained context independent word embeddings like word2vec as input to a RNN for tasks like sentence classification, PoS tagging etc. In this process, the hidden layer of the RNN essentially produces a context dependent word representation at each time step. Since this process usually takes place in a supervised setting, the amount of data can be a limiting factor. By pretraining these context dependent word representations in an unsupervised setting instead, we can obtain much richer representations that can then be used directly as input for downstream tasks. The downstream task models then use a limited amount of labeled data to convert these into task specific embeddings.
This idea is not a new one. The difference between the previous works and ELMo looks like a small but important one — previous works use a multi layer LSTM and then pick the final layer output as the embedding, whereas ELMo uses a linear combination of all the layers. The weights for combining the layers are learned while training for a downstream task. The reason it works better is because it has been shown that different layers of the LSTM are able to capture different linguistic phenomena e.g. the lower layers are able to predict PoS tags better whereas the higher ones are able to learn word sense. So by combining all layers, we are leveraging the specific abilities of each one.


The core idea : 
Generate context dependent word vectors using learned functions of the internal states of a deep bidirectional language model (biLM), which is pretrained on a large text corpus.


Why I liked reading it :
This was an introduction to context dependent word embeddings for me. I think the general idea is super cool because it pushes a huge chunk of the learning from the supervised into the unsupervised domain.


Detailed Summary :
  • The most commonly used word embeddings today like word2vec and GloVe are context independent are fail to capture polysemy. For instance, the word “play” will have only one embedding vector, even though it could have different meanings based on context. This paper proposes a type of context dependent embedding i.e. the embedding of a word is a function of the entire sentence. My first thought was that there are infinite number of possible contexts so it’s impossible to provide pretrained context dependent embedding vectors. Sure enough, it doesn’t look like there’s a dictionary of vectors that one can use in this case. Instead, to use these embeddings, one must run the input word through a pretrained network to obtain the embedding.
  • Similar to word2vec, the network is pretrained on a language modeling objective. This is why they are named Embeddings from Language Models (ELMo).
  • The model first runs a CNN over the input word’s characters to obtain a context independent representation of the word and then pass them through L layers of bidirectional LSTMs. The reason for using a bidirectional LSTM is that it captures context both before and after the target word. Each LSTM layer can be thought of as producing a context dependent representation. Including both directions and the context independent representation produced by the CNN, there are 2L + 1 representations for each token. The final ELMo embedding is obtained by taking a linear combination of these 2L + 1 embeddings. The weights for the linear combination can be learned by the end task model in a supervised setting.
  • During pretraining, the top layer embedding is used to predict the next word using a softmax layer. In the final model, L is set to 2. It’s not clear why 2 was chosen and there’s no analysis on greater values of L.
  • To evaluate the embeddings, the authors feed them into the state of the art models for 6 different tasks – Question Answering on the SQuAD dataset, Textual Entailment on the SNLI dataset, Semantic role labeling (SRL) on the OntoNotes benchmark, Coreference Resolution on the CoNLL 2012 shared task, NER on the CoNLL 2003 shared task and Sentiment Analysis on the Stanford Sentiment Treebank. The authors find that the ELMo embeddings help achieve SoTA results on all 6 tasks.
  • The authors perform some ablation analysis to show that combining all layers of the LSTM helps performance. They do this by comparing the baseline with only the top layer of the LSTM and using a combination of all layers. They also show that the first LSTM layer is better at prediction PoS tags than the final layer (by 1.6 F1 points) and the final layer is better at Word Sense Disambiguation than the first (by 0.5 F1 points). This leads to the conclusion that the first layer captures syntactic information better than the second and the second captures semantic information better, which explains why it’s better to combine all layers. It would be interesting to see how this behaviour changes with more layers.
  • Another interesting benefit of ELMo embeddings is that it increases the sample efficiency of task models i.e. models with ELMo need fewer examples and parameters to achieve SoTA results. There was no reason given for this behaviour in the paper.

Breaking the Softmax Bottleneck: A High Rank RNN Language Model

Quick summary :
Neural language models typically use a softmax on top of a dot product of the hidden layer with word embeddings. This can be seen as a matrix factorization problem where the goal is to find a factorization of the conditional log probability matrix A that corresponds to the true data distribution. The authors show that to effectively express the true data distribution of natural language using the softmax based language model, the dimension d of the word embeddings must be high. Typically, these embeddings are O(100) dimensional and therefore, these language models suffer from what is called the softmax bottleneck. Using a mixture of softmaxes model is proposed as a solution to get rid of the bottleneck. This happens because the A becomes a nonlinear function of HW’ in the mixture of softmaxes scenario and becomes trainable to achieve a high rank.

The core idea :
Softmax based language models with distributed (output) word embeddings do not have enough capacity to model natural language

Why I liked reading it :

  • Grounded my understanding of basic linear algebra concepts in the simple NLP task of language modeling.
  • Uses theoretical math with lemmas and proofs to arrive at a state of the art model.

Detailed summary :

  • Language modeling mostly relies on breaking the joint probability of a sequence of tokens into a product of conditional probabilities of the next word given a context and then modeling these conditional probabilities P(x|c). This is taught even in probability 101 but I didn’t know that it is known as an “auto regressive factorization”.
  • Standard approach in neural language models (NLMs) is to use a recurrent neural network (RNN) to encode the context into a fixed length vector (also called the hidden state), take dot product with a word embedding and pass into a softmax layer to give a categorical probability distribution over the vocabulary. It sounds very similar to the model used in the word2vec paper.
  • The input that goes into a softmax function is called a “logit”. In the class of models being referred to in this paper, the logit is the dot product of a hidden state and a word vector.
  • One of the key concepts in this paper is the matrix A which represents the conditional log probabilities of the true data distribution. Imagine that a language has M tokens and N different contexts. For each token x and context c, we have a true probability P(x|c). Aij is simply equal to the conditional log probability of the jth token given the ith context. For the softmax case, Aij is simply equal to the logit for the ith context and jth token. Another way to think about it that I found helpful is that each row of A is the true log probability distribution over tokens for a given context.
    An interesting feature is that there are an infinite number of possible matrices A that could correspond to the true data distribution because of the shift invariance the softmax function i.e. softmax(x + c) = softmax(x). The authors show that this property can be used to prove that adding arbitrary rowwise shift to A will give a matrix that also corresponds to the true data distribution.
  • Our goal then is to find logits A’ij s.t. A’ corresponds to the true data distribution. The authors describe the problem as finding a matrix factorization HW for A’ where H is a Nxd matrix consisting of all the possible hidden states and W is a Mxd matrix consisting of all word embeddings. (Why is H Nxd? Why are hidden states d dimensional?)
  • For natural language, language modeling is equivalent to trying to factorize the matrix A that corresponds to the true data distribution of natural language into matrices H and W. Using some linear algebra, it is possible to show that d must be greater than or equal to the rank(A). The final softmax bottleneck statement says that if d < rank(A) – 1, then the model cannot express the true data distribution of natural language.
    The exact equation is probably not as important as the concept i.e. since the dimension d is typically O(10^2) whereas the rank(A) for natural language can be O(10^5).
  • The most obvious fix is to increase d but this increases the number of parameters by too much leading to potential overfitting. Another one is to use a non parametric model like Ngrams but this again can lead to overfitting because of the large number of parameters.
  • The proposed solution is a simple one — use a mixture of k softmaxes. It was quite hard to figure out from reading the paper why this method alleviates the bottleneck. The paper states “Because Amos is a nonlinear function (log_sum_exp) of the context vectors and word embeddings, Amos can be arbitrarily high rank. As a result, MoS does not suffer from the rank limitation, compared to Softmax.”
    Because of my lack of linear algebra skills, I was unable to figure out why this is true and didn’t find any answers online. I eventually emailed the authors to clarify this point and Zihang Dai was generous enough to respond :

“As you may know, a linear operation (matrix multiplication) does not change the rank. In contrast, a non-linear operation can (has the capacity) change (not necessarily increase) the rank of a matrix.
However, it is not guaranteed that every non-linear operation will increase the rank. In other words, log_sum_exp may increase the rank of some matrices, but not others.

However, remember that
* (1) the inputs we give to log_sum_exp in MoS are trainable
* (2) the output of log_sum_exp will have better performance if it has a higher rank
* (3) log_sum_exp has the capacity to induce a higher rank for some matrices
Puting these points together, it means MoS has the capacity to induce a higher rank log probability matrix ( i.e. A ^ MoS), and it can be trained to exploit this advantage.”

  • First of all — just wanted to call out how cool it is that he wrote such a thorough response. While this response cleared a lot for me, it’s still not quite clear to me how nonlinear operations can change the rank but linear operations can’t. I’m guessing it has something to do with linear operations not having any affect on the linear dependence between vectors of the matrix.
  • I did not dive deep into the results but here’s a bird’s eye view — this technique is able to improve the state of the art for language modeling on the Penn Treebank and Wikitext-2 datasets. The authors also empirically show that the A matrix obtained using MoS is indeed high rank and that as rank increases, performance improves.

Scratchpad (i.e. random thoughts/questions/comments) :

– What is a non parametric model exactly? what is it’s significance?
– What is the rank of a matrix intuitively?
– Generally, improvement in performance of any ML model can be obtained by :
     – Higher capacity model (i.e. model was underfitting)
     – Better regularization (i.e. model was overfitting because capacity was higher than required to fit the data)
     – Optimization tricks (e.g. adding momentum)
     – More training data
– I read about the idea of a “logit” when learning about logistic regression. I also learned recently that softmax is just a multivariate version of logistic regression ( So I’m guessing the logit being referred to in this paper is the same idea.


Think like a machine

The most exciting problem to me is programming a machine to think and do like a human would. It’s ironic then that today I’m going to talk about how thinking like a machine has helped me become more productive. It has helped me solve one particular problem that has plagued me for years – reacting negatively when I fail to execute well on my plans. In fact, “reacting negatively” is an understatement. I would plunge into a downward spiral of stress eating and crippling self deprecation that would render me unable to do anything productive for days. For instance, if I planned to wake up at 6am everyday and missed for a few days in a row. Or if I fail a test or an interview. Or if I dream of solving AI but wake up everyday and realize that I’m just an average, over-reaching programmer.

Over the years I’ve learned to think of myself as a machine. A machine would continue performing at it’s peak ignoring any errors or if it’s smart enough, plunge into rapid self improvement to try to prevent itself from making the error again. If you think about it, this is the nature of technology, or rather of any evolutionary process : Try to change something. It might work really well or it might not work as expected. Try changing something else.

It’s the only sensible thing to do.

This might seem obvious but it wasn’t to me and internalizing it has produced huge gains in productivity. If I make plans to accomplish something over a weekend and wake up on Sunday having accomplished nothing, it’s become easier for me to get to work and try to make up for the lost time instead of spending Sunday crying about that lost time. And while making plans for next weekend, I’ll simply look back and try to figure out something small I could change to avoid this problem. This is super important because I like to plan out everything. And plans can be derailed by the tiniest unforeseen occurrences. This is one of the most effective ways I’ve found to deal with these occurrences. I know this isn’t the most actionable idea but if you suffer from a similar problem, I would encourage you to give it a thought.