Building a chatbot for ConvAI 2018

I decided to try building a chatbot for the ConvAI NeurIPS challenge 2018. The expectation was not to produce something competitive but to simply learn what it takes to build a chatbot. This post summarizes what I built and the process.

The Personachat dataset was provided for use. Details about the dataset are in this paper. It has about 11k dialogs crowdsourced from Mechanical Turk.

The dataset and some baseline models were made available in Facebook’s ParlAI platform. I decided to use ParlAI for model training and evaluation since it provided in built integration with the dataset as well as standard evaluation scripts.

The Personachat paper provided some baseline models primarily of two types : 1. retrieval based i.e. the model picks top response from a list of known responses and 2. generative i.e. the model generates the response tokens from scratch. I decided to build a generative model since I had already played around with text generation. I ran a series of experiments to try to beat the baseline models.

Experiment 1 : Language model
I started off using a single hidden layer LSTM with an embedding layer on top of the one hot encoded input and a linear transformation on top of the hidden layer to produce an output token. The ParlAI platform proved to be super unwieldy and I had to make a lot of minor code changes to get good results e.g. my dictionary was not lowercased by default which increase the size of the dict by a lot, my preprocessing settings during testing were different from training. I used cross entropy as the loss function and trained until perplexity got worse on the validation set which resulted in a test perplexity of 90.

Experiment 2 : Language model trained on a different dataset
I tried training on the Cornell dataset instead of the convai2 dataset because I was curious how that would impact performance. I was only able to reach 400 test perplexity which was way worse. I realized that while trying to transfer my language model learned on the Cornell dataset to the convai2 task, one problem will be that the vocab might not include key words from the convai2 task like person1, person2.

Experiment 3 : Language model with GloVe initialization
I then added GloVe vectors to init my language model. This was based on the realization that my plain language model was essentially trying to learn embeddings during training so pretrained embeddings should only make it better. This turned out to be true and improved the test perplexity to 62.

To further improve my model, a couple of ideas were using a bidirectional LSTM and attention. While trying to figure out how to use attention, I looked at this example : but it uses a separate encoder and decoder. And there’s a fundamental difference in the way input is fed into the seq2seq vs the language model. I confirmed what my current language model does and it just sends a stream of text as input which includes both the person1 and person2 utterance e.g. “person1 how are you doing ? person2 I am good . How are you ? person1 I am well thanks . What do you do ? person2 I work at a bank ”. This whole stream of text will be sent as input and the target will be the same thing shifted left by 1 word.

In the seq2seq case on the other hand, we would feed the person1 sentence as input with the person2 sentence as target.
So it doesn’t really make sense to use attention with my language model because attention only really makes sense in the encoder decoder framework.

Experiment 4 : Use seq2seq
I simply copied over the for the language model and replaced the language model agent with ParlAI’s inbuilt seq2seq agent. It just worked out of the box! It was able to consume the convai2 data. I don’t think validation worked though. I looked at what goes in as input to the seq2seq model. I’d thought it would be a person1 utterance as input and person2 utterance as output but actually, what happens is that a fixed number of words (100 in this case) of the entire conversation history is sent as input upto the latest utterance. The target is the next utterance.

e.g. input : “do for fun ? ? enjoy you do what . ? makeup about all learning all learning i like to hike and canoe . i am not as busy after graduating . . . chat crazy my for apologize i i yeah this chat is weird ! how old are you ? i am i in my thirties cool , i am in my twenties . what kind of work do you do ? i also take beauty classes . youtube . youtube makeup tutorials . . oh that is so cool ! what do you do since graduating ?”
output : “i work at a restaurant right now , but eventually i would like to teach ! __END__ __NULL__ __NULL__ __NULL__ __NULL__”

This actually made sense because the entire conversation history should inform the next utterance prediction.

The best validation ppl during training was 65.7. This is much worse than the language model with GloVe but I’d guess roughly the same or perhaps better than the language model without glove.

Conclusion : The best model I ended up with was the LSTM language model with GloVe. Here’s a sample conversation with the model :

Enter Your Message: hello how are you
PERSON1 hello how are you
[LanguageModel]: person2 i ‘m doing well . how are you ?
Enter Your Message: im good thanks . I like your shirt .
PERSON1 im good thanks . I like your shirt .
[LanguageModel]: person2 i love to read . i love to read .
Enter Your Message: that is good to know . What are you reading right now ?
PERSON1 that is good to know . What are you reading right now ?
[LanguageModel]: person2 i ‘m going to school for a new job .
Enter Your Message: that is great ! what program are you in ?
PERSON1 that is great ! what program are you in ?
[LanguageModel]: person2 i ‘m a student . i ‘m a student .
Enter Your Message: what are you studying ?
PERSON1 what are you studying ?
[LanguageModel]: person2 i ‘m a student . i ‘m a teacher .
Enter Your Message: how can you be both ?
PERSON1 how can you be both ?
[LanguageModel]: person2 i ‘m not sure i ‘m not sure

It’s clearly not great but seems to give decent answers  to some generic questions. Some things I’d like to work on would be to use ParlAI’s integration with Facebook Messenger to use to test the chatbot and try a hybrid retrieval and generation based model.

Code :