AI-powered chatbots have been widely adopted by enterprises seeking to streamline their customer service, improve productivity and boost revenue. On e-commerce platforms chatbots can direct customers to recommended products, track orders, explain how print a return shipping label and so on.
Such chatbots however don’t do so well with off-target or tangential talk — for example if asked to comment on the latest art or fashion trends. “Meena,” Google AI’s new generative chatbot, has a thing or two to say about that.
One of a new breed of open-domain chatbots designed to engage in conversations across any topic, Meena’s free and natural conversational abilities are closing the gap on human performance.
Introduced in Google AI’s recent paper Towards a Human-like Open-Domain Chatbot, Meena’s main architecture is a seq2seq model with the Evolved Transformer. It was trained on 341GB of text (40B words) mined and filtered from public domain social media conversations. Compared with OpenAI’s language model GPT-2, Meena is 1.7x bigger in model capacity and was trained on 8x more data.
According to the research team, the best Meena model has 2.6B parameters and achieves a test perplexity of 10.2 based on a vocabulary of 8K BPE subwords.
To evaluate Meena’s performance, researchers proposed a simple human evaluation metric called Sensibleness and Specificity Average (SSA), which considers two fundamental aspects of humanlike conversation: making sense and being specific. The results suggest that the full version of Meena (with a filtering mechanism and tuned decoding) scores 79 percent SSA, which is a full 23 percent higher in absolute SSA than existing SOTA chatbots such as Mitsuku, Cleverbot, XiaoIce, and DialoGPT.
Meena is also closing in on humans, whose average SSA score is 86 percent. In a surprising finding, the researchers observed a strong correlation between SSA and perplexity — an automatic metric available to any neural seq2seq model. The experiments demonstrated that the better Meena fit its training data, the more sensible and specific its responses became.
Researchers admit that weaknesses remain in their methodology — for example the static evaluation dataset is too restricted to capture all aspects and nuances of human conversation.
In their future studies the researchers will explore broadening the metric for humanlikeness, while continuing to focus on optimization of sensibleness via the optimization of test set perplexity and improving algorithms, architectures, data and compute. They will also consider other attributes such as personality and factuality, with model safety and bias additional key focus areas.
The paper Towards a Human-like Open-Domain Chatbot is on arXiv. Sample conversations with Meena are on GitHub.
Author: Yuqing Li | Editor: Micahel Sarazen