Statistical Spoken Dialogue Systems and the Challenges for Machine Learning

This talk describes the dialog system architecture and explains the three main steps of the architecture: understanding, generation, and dialog manager and their challenges for machine learning.

Slides: — Steve Young


This talk describes the dialog system architecture and explains the three main steps of the architecture: understanding, generation, and dialog manager and their challenges for machine learning.


Spoken Dialog System

Spoken dialog systems is a label that denotes a wide range of systems, from simple weather information systems (“say the name of your city”) to complex problem-solving, reasoning, applications. The distinction between simple and complex systems is whether the they are able to use spontaneous speeches. There are two prototypical classes of dialog systems as shown below.

1Figure 1: Two prototypical classes of the spoken dialog system.

The most well-known command-based system is probably Siri, which conducts a simple question-answer role based on the pre-defined rules. Most current dialog systems belong to this class. The dialog system architecture is a chain of processes, where the system takes a user utterance as input and generates a system utterance as output.

2Figure 2: The dialog system architecture.


Automatic Speech Recognition (ASR) takes a user’s spoken utterance and transforms it into a textual hypothesis of the utterance. Then, Natural Language Understanding (NLU) parses the hypothesis and generates a semantic representation of the utterance.
CNN (Convolutional Neural Network), the key component of understanding, has been widely used to extract lexical features from textual hypothesis. It scans each utterance applying convolution windows of 1, 2, 3, 4, or more words.

3Figure 3: Using CNN to extract lexical features.


After taking the semantic representation of a communicative act from the system, the NLU generates a textual representation, possibly with prosodic markup, that is to be synthesized by a speech synthesizer.
Most dialog systems use actions for deciding the sentences to generate. The actions are often abstract, which should be converted to natural language by delexicalising the training data and training a semantically constrained Long Short Time Memory (SC-LSTM).

4Figure 4: Using LSTM to train natural language sentences.


Dialog manager is the key component of a spoken dialog system, which decides the response from concepts to actions. The belief state b encodes the state of the dialog, including all relevant history, and b is updated every turn of the dialog. The policy π determines the best action to make at each turn via a mapping from the belief state b to actions a. Every dialog ends with a reward: +ve for success, -ve for failure, plus a weak -ve reward for every turn to encourage brevity. Reinforcement learning is used to find the best policy.

5Figure 5: Dialog manager maps belief state to actions.

  • Policy Representation

There are two ways to represent policy states:

Gaussian Processes: data efficient, includes explicit confidence on Q-value. Can support large n, but action space |A| is limited.
Deep Neural Networks: scale well on both n and |A|, but no built-in confidence measure and poor convergence properties.

  • Training

The use of reinforcement learning improves the performance of GP policy and DNN policy in both interactive settings, especially under higher-noise conditions.

Figure 6: Performance improved by using reinforcement learning.

  • Curse of Dimensionality and Domain Complexity

Multi-domain is a complex issue which is hard to deal with in spoken dialog systems.


Figure 7: Multi-domain and its complexity.

In multi-domain spoken dialog systems, training multiple domains in parallel can achieve better result than in isolation.
Getting the time information as shown in Fig.8, which belongs to the time domain in calendar domain is a domain complexity problem. In this case, hierarchical deep reinforcement learning is used to solve the issue.

8Figure 8: Getting time in calendar domain.


Figure 9: Hierarchical deep reinforcement learning.

  • Measuring Success

Task success is not always obvious due to errors of predicting policy. On-line reward estimation uses LSTM to encode and GP-based reward estimator to determine the best reward signal. It achieves the best policy optimization result especially when the size of training dialogs becomes larger.

10Figure 10: On-line reward estimation.


  1. POMDPs and Reinforcement Learning provide a powerful mathematical framework for decision making in intelligent conversational agents.
  2. DNNs provide a flexible building block for all stages of the dialog system pipeline, though training is often problematic.
  3. Unrestricted conversation is challenging but there are several promising approaches to managing complexity.
  4. For commercially deployed systems, the user is a tremendous untapped resource, and Reinforcement Learning provides the framework for exploiting it.


  1. Su, P-H, et al., Continuously Learning Neural Dialogue Management, arXiv:1606.02689
  2. M. Gasic et al (2015). “Policy Committee for Adaptation in Multi-domain Spoken Dialogue Systems.” IEEE ASRU 2015, Scotsdale, AZ.
  3. T. Kulkarni et al (2016). “Hierarchical Deep Reinforcement Learning: Integrating Temporal Abstraction and Intrinsic Motivation.” arXiv:1604.06057.
  4. P-H. Su et al (2016). “On-line Active Reward Learning for Policy Optimisation in Spoken Dialogue Systems.” ACL 2016, Berlin.


Analyst: Oscar Li | Localized by Synced Global Team : Xiang Chen

0 comments on “Statistical Spoken Dialogue Systems and the Challenges for Machine Learning

Leave a Reply

%d bloggers like this: