Last month Google and Amazon unveiled mini-sized smart speakers Google Home Mini Chalk and Amazon Echo Dot 2, both priced at less than US$50. The virtual assistants inside these discreet tabletop devices can plug into your environment and manage your everyday life — all with a natural human-machine voice interface that used to exist only science fiction.
When the first chatbot Eliza was created by Joseph Weizenbaum at the Massachusetts Institute of Technology in 1966, it could produce appropriate responses, saying for example “I am sorry that you feel depressed.” But this functionality involved nothing more than detecting cue words in the input, which triggered pre-programmed responses.
Compared to Eliza, today’s smart speakers are significantly more human-like and multi-functional. Microsoft’s Cortana is a smart voice assistant known for its sardonic “humor”; while Google Home can manage a lighting scheme, order a pizza or play trivia games with you. Although current smart speakers have not approached the intelligence level of Samantha — the AI operating system in the sci-fi film Her — they are still impressive enough to prompt over 25 million users to purchase Echo and 5 million users to give Google Home a shot. And this is just the beginning.
Their hands-free operation make smart speakers a user-friendly machine-human interface that opens new possibilities in functionality — this is the interface of the future. But even as we continue to find new things for virtual assistants to do, few understand how they actually do it.
Synced recently sat in on a Silicon Valley roundtable meet-up to discuss the tech inside today’s virtual assistants.
Dr. Junling Hu is the founder and CEO of Question.ai and chair of the AI Frontiers Conference. She introduced six key components in building a smart virtual assistant: speech recognition, speech synthesis, natural language understanding, a dialogue system, a chatbot, and a music recommendation system.
Speech Recognition & Speech Synthesis
A smart speaker must detect and translate human voices into a format readable by computers, aka speech recognition. Researchers struggled with noise for dozens of years before deep learning brought revolutionary changes. In 2012, deep learning-based AlexNet won the 2012 ILSVRC (ImageNet Large-Scale Visual Recognition Challenge), and achieved significant results in image recognition. Deep learning has been successfully applied to speech recognition since 2013.
“Because of deep learning, we are able to do end-to-end speech recognition,” says Dr. Hu.
To create such a user interaction, smart speakers must develop far-field voice recognition. Echo features a seven-microphone array, enabling it to hear and process voice commands from afar or even in noisy rooms. Echo also has applied speaker adaption to identify the voices of different users.
Smart speakers are awakened by a two-step method called Anchored Smart Detection. First introduced by Amazon team, it uses distinct recurrent neural networks (RNN) to recognize the wake-up word and subsequent user request.
Another speech-related technology used by virtual assistants is speech synthesis, which converts words into sounds once a smart speaker decides what to say.
A virtual assistant must abbreviate the text input into graphene through the lexicon (dictionary). The graphene is then converted into a phoneme string, which is a unit that distinguishes one word from another. Once the assistant has a phoneme, it can proceed with prosodic modeling, pitch, length, loudness, intonation, and rhythm. The assistant then sends the phoneme string and prosodic annotation into acoustic synthesis models for conversion into smooth speech delivery.
“Our Alexa speaks very well, not like a machine or robot, which is because we’ve researched prosodic modeling,” says Dr. Hu.
Central component – Natural Language Understanding
One of the steps between speech recognition and speech synthesis is natural language understanding (NLU), which is key to enabling a virtual assistant’s reading comprehension. When users say something like “I play the piano,” the system cannot understand what “I” or “piano” indicate. This is where NLU researchers must find a methodology to transform human language into a standard format.
A virtual assistant will first pick out proper names such as cities, companies and song titles. The next part-of-speech step is to classify words into eight lexical categories: verb, noun, determinative, adjective, adverb, preposition, coordinator, or interjection.
The last step is parsing, a process that analyzes a string of natural language or computer language symbols in accordance with grammatical rules. However, parsing might not work correctly if users don’t follow grammatical rules. And so many researchers are now turning away from parsing and looking instead for end-to-end NLU solutions.
A Dialogue System
Virtual assistants require two more steps to initiate a dialogue: detect users’ intentions, and decide how to respond.
The dialogue act step connects user requests — for example requesting a joke, a song, or pizza delivery — with appropriate system functions. After receiving an input, the virtual assistant will capture features in terms of tense, first word, predicate; and classify these features to one of the dialogue acts, such as request, statement, Yes/No question, or confirmation.
The dialogue policy step follows the dialog act step, and decides what action the system should take next.
As users are inclined to have natural conversations with multiple sessions of interaction with their virtual assistants, a dialogue system needs a state tracker to maintain the current state of the dialogue. This includes the user’s most recent dialogue act and all the information (entire set of slot-filler constraints) users have expressed so far.
Dialogue systems are now being built using reinforcement learning, which is producing virtual assistants that are more astute in responding to user requests.
Chatbot & Recommendation
Clearly, a smart speaker is more than just a dialogue robot, it also requires skills such as a chatbot or music recommendation service to adapt to different users’ different needs.
A chatbot that is human-like in both text and voice is the key to smooth human-machine interaction. Companies like Amazon and Google are encouraging developers to create device-embedded chatbots based on open source chatbot APIs such as Amazon Lex chatbot or Google’s API.ai.
Music recommendation, one of the other central skills of today’s smart speakers, acts as a sort of personalized radio station by leveraging the enormous amount of data with machine learning training to play the most favorable songs for users. This works in much the same way as product recommendations on Amazon or video recommendations from YouTube.
The music recommendation algorithm learns how to rank users’ preferences by using listening features (listen time), user features (income, age, gender, geo-location, etc.), and item features (title, artists, genre, channels, keywords, etc.) as the database.
In the near future, there won’t be much difference between smart speakers in terms of functionalities or performance. The distinction will be based more on versatility of skillsets. Both Echo and Google Home are racing to develop and integrate as many as they can.
To what extent will tomorrow’s virtual assistants manage our lives? With enough data and the right models there’s no limit:
“Does the Roma restaurant have Moretti on tap?”
“They have Moretti at $5 a pint, on special until 7 p.m. today. But mind your limit because you are picking Leslie up at 9 p.m. If you have more than two pints I will have to disable your vehicle… And you’ve gained two kilos do you really want to eat Italian?… Also the cat brought home a mouse… “
AI Frontiers Conference, a Silicon Valley-based AI conference that gathers leading figures in artificial intelligence, will be held this Friday at Santa Clara Convention Center.
Please hurry up and sign up at www.aifrontiers.com
Journalist: Tony Peng | Editor: Michael Sarazen