This paper introduces ERICA, an autonomous android system capable of conversational interaction. It features advanced sensing and speech synthesis technologies, and is arguably the most human-like android built to date.
The extreme human-like qualities of ERICA stems from her visual design, facial expressiveness, and highly expressive speech synthesizer. Her sensing technologies are some of the most capable to date, with high-performance speech recognition, the ability to discriminate between sound sources using microphone arrays, and precise tracking of people’s locations and movements.
The ultimate goal of ERICA is to make her communicate in a convincingly human-like way in face-to-face interactions.
Figure 1: Photograph of ERICA.
Background of Current Androids
In recent years, androids have become increasingly visible in both research and the popular media. Android replicas of celebrities and individuals are appearing in the news, and androids are depicted in film and television living and working alongside people in daily life. However, today’s androids are often very limited in their ability to conduct autonomous conversational interactions. Currently, androids can be classified into the following categories:
1. Non-humanoid robots and virtual agents
Highly realistic virtual agents that can interact conversationally have already been created. The Virtual Human Toolkit  provides a set of tools for dialog and character design for photorealistic animated graphical avatars. Furhat  is a robot that aims to bridge the gap between 2D and 3D, with a movable head and a back-projected face, it is capable of a wide range of facial expressions.
2. Humanoid robots
Several humanoid robots with varying degrees of anthropomorphism have been developed. They are human-like enough to conduct interesting interactions using natural gestures and other social cues. These robots often take mechanical, animal, cartoon, or abstract forms. Leonardo  is an example of a highly-expressive robot designed for human interaction research. Aldebaran’s Nao robot (https://www.aldebaran.com/en/humanoid-robot/nao-robot) is a widely-used platform for human-robot interaction research, and their recently released robot Pepper (https://www.aldebaran.com/en/a-robots/who-is-pepper) is another promising platform for rich interactive human-robot communication.
A variety of lifelike androids have already been developed. Hanson Robotics has produced many highly-expressive human head robots, such as the PKD robot , BINA48, Han, and Jules, some of which have been placed on bodies. These robots exhibit advanced AI techniques and highly articulated facial expressions. But they often look robotic, with metallic parts or exposed wires, and generally lack expressive speech synthesis. The Geminoid series of androids  also feature highly human-like appearance and expressiveness .
Here we introduce the ERICA platform architecture:
Hardware and Actuation
The mechanical and aesthetic design of ERICA were developed together with the android manufacturer A-Lab (http://www.a-lab-japan.co.jp/en/).
Her facial feature proportions were determined based on principles of beauty theory used in cosmetic surgeries, such as the ideal angle and ratios for the so-called “Venus line”, or Baum ratio, defining the angle of projection of the nose, and the “1/3 rule” specifying equal vertical spacing between the chin, nose, eyebrows, and hairline .
ERICA’s body has 44 degrees of freedom (DOF), depicted in Fig. 2, of which 19 are controllable. The skeletal body axes shown in black in Fig. 2 (right) are actuated.
Figure 2: Degrees of freedom in ERICA. Left: Facial degrees of freedom. Right: Skeletal degrees of freedom. Joints marked in black are active joints, and joints drawn in white are passive.
2. Speech Synthesis
ERICA’s speech synthesis is performed using a custom voice designed for Hoya’ s VoiceText software (http://voicetext.jp/). Default rendering of most sentences is typically smooth with intonation determined by grammar. Manual specification of pitch, speed, and intensity is also possible. The generated audio signal from the speech synthesizer is sent back to the robot to generate lip sync and body rhythm behaviors, as shown in Fig. 3
ERICA currently uses external sensors on a wired network for human position tracking, sound source localization, and recognition of speech and prosodic information. The elements of the sensing framework are shown on the left side of Fig. 3.
Figure 3: System diagram illustrating sensor inputs, internal control logic, and interaction with speech synthesis and motion generation.
4. Control Architecture
The software architecture for the ERICA platform combines a memory model, a set of behavior modules for generating dynamic movements, and a flexible software infrastructure supporting dialog management. The center area of Fig. 3 illustrates the core elements of the interaction logic.
In the public demonstration, members of the press and the public were invited on stage to direct questions at ERICA or the researchers using a wireless microphone, as shown in the photo in Fig. 4.
Figure 4: Photo of the public demonstration.
A list of 30 topics were shown on a projection screen, and visitors took turns asking ERICA about those topics. After responding to each question, ERICA asked a question in return, based on the dialog state history. For example (translated from Japanese):
Visitor: How old are you?
ERICA: I’m 23 years old. Even though I was just built, please don’t call me 0 years old. (laughs)
ERICA: Do you think I look older?
Visitor: Yes, I think so.
ERICA: (giggles and smiles) Thanks! People always think I look younger, so I’m happy to hear that.
ERICA also responded to utterances of the researchers and the MC at different times in the demonstration. The visitor, the MC, and the two researchers each had separate microphones, and each microphone was independently processed for speech recognition and prosodic information. This enabled ERICA to respond to each person in an appropriate way. For example:
Researcher: (Turns to ERICA after answering a visitor’s question). ERICA, you’re the greatest robot ever, aren’t you?
ERICA: (Turns to the researcher and smiles) Yes! (Then, after a short pause, makes a worried expression) Well… actually, we’ll see. That depends on how well my researchers program me.
Achievements and Future Work
At least one news agency (http://mashable.com/2015/08/12/erica-android-japan/) reported on the demonstration with the headline, “Japan’s Erica android isn’t as creepy as other talking robots.” In the future, full-body poses and expressiveness will be necessary.
The naturalness and expressiveness of the speech synthesis is quite satisfying. In the future, utterances will be generated along with gestures and expressions.
1. Explicit Expressions and Gestures
ERICA uses subtle, human-like facial expressions. With ERICA’s hardware configuration, it would be difficult to create very dramatic expressions. But for everyday tasks, subtle expressions would likely be more useful, especially given the modest level of expressiveness in Japanese culture.
2. Implicit Behaviors
During ERICA’s interactions, implicit behavior modules were used to actuate breathing, blinking, gaze, speaking rhythm, and backchannel nodding. In the future, the modules will be improved and formalized for a variety of new implicit behaviors, such as motion control for laughter, unconscious fidgeting, and methods of expressing emotion implicitly through adjustments of gaze and body movement.
3. Multimodal Perception
The capabilities of ERICA’s sensor network was quite sufficient for this demonstration. In the future, paralinguistic information conveyed by speech will be collected, by accounting for prosodic information extraction in noisy environments.
4. Desire and Intention
Currently, ERICA’s application logic is all manually crafted as sequences of utterances. In the future, visual tools such as Interaction Composer  will be incorporated to assist the process of interaction design. Eventually, it will be necessary to generate behavior based on representations of semantic meaning and desire and intention of the robot.
ERICA is the most human-like android today thanks to her visual design, facial expressiveness, and highly expressive speech synthesizer. Her sensing technologies are some of the most capable to date, with high-performance speech recognition, the ability to discriminate between sound sources using microphone arrays, and precise tracking of people’s locations and movements. This work will help provide insight on what is possible given the current state-of-the-art, and to identify key issues, allowing researches to understand the next steps on the path to create truly human-like androids.
- A. Hartholt, D. Traum, S. C. Marsella, A. Shapiro, G. Stratou, A. Leuski, L.-P. Morency, and J. Gratch, “All together now: Introducing the Virtual Human Toolkit,” in Intelligent Virtual Agents, 2013, pp. 368-381.
- S. Al Moubayed, J. Beskow, G. Skantze, and B. Granström, “Furhat: a back-projected human-like robot head for multiparty human-machine interaction,” in Cognitive Behavioural Systems, ed: Springer, 2012, pp. 114-130.
- C. Breazeal, A. Brooks, J. Gray, G. Hoffman, C. Kidd, H. Lee, J. Lieberman, A. Lockerd, and D. Mulanda, “Humanoid robots as cooperative partners for people,” Int. Journal of Humanoid Robots, vol. 1, pp. 1-34, 2004.
- D. Hanson, A. Olney, S. Prilliman, E. Mathews, M. Zielke, D. Hammons, R. Fernandez, and H. Stephanou, “Upending the uncanny valley,” in Proceedings of the national conference on artificial intelligence, 2005, p. 1728.
- S. Nishio, H. Ishiguro, and N. Hagita, Geminoid: Teleoperated android of an existing person: INTECH Open Access Publisher Vienna, 2007.
- C. Becker-Asano and H. Ishiguro, “Evaluating facial displays of emotion for the android robot Geminoid F,” in Affective Computational Intelligence (WACI), 2011 IEEE Workshop on, 2011, pp. 1-8.
- P. M. Prendergast, “Facial proportions,” in Advanced Surgical Facial Rejuvenation, ed: Springer, 2012, pp. 15-22.
- D.F.Glas,S.Satake,T.Kanda,andN.Hagita,”AnInteractionDesign Framework for Social Robots,” in Proceedings of Robotics: Science and Systems, Los Angeles, CA, USA, 2011.
Analyst: Oscar Li |Editor: Joni Zhong | Localized by Synced Global Team : Xiang Chen