The development of powerful pretrained language models and the tech’s virtually unlimited application and commercialization potential have made natural language processing (NLP) one of the hottest research areas in machine learning, with hardware and data breakthroughs driving performance leaps and leaderboard races across industry and academia.
This would seem an apt time to pause and reflect on the direction of NLP, and explore language in the broader AI, Cognitive Science, and Linguistics communities. In the new paper Experience Grounds Language, researchers “posit that the universes of knowledge and experience available to NLP models can be defined by successively larger world scopes: from a single corpus to a fully embodied and social context.” The distinguished group of researchers — including Turing Award winner Yoshua Bengio — hail from Carnegie Mellon University, University of Washington, MIT, MILA, University of Michigan, University of Edinburgh, DeepMind, University of Southern California, Semantic Machines, and MetaOptimize.
The limitations of corpora for covering language and experience have been recognized at various phases in the history of NLP research. In this study, researchers define the knowledge and experience available to NLP models using what they call a “World Scope” with five levels. The first, WS1, is the Corpus (our past), WS2 is the Internet (our present), WS3 is Perception, WS4 Embodiment, and WS5 Social.
A corpus has been the critical component in computer-aided, data-driven language research. The Penn Treebank for example is a large annotated corpus that includes over 4.5 million words of American English. The Penn Treebank was introduced in 1993 to study representations, as it was believed that text and spoken language understanding could be improved by automatically extracting information about language from very large corpora.
Fast forward to the first decade of the 21st century, new NLP tasks are introduced, and large web-crawls became viable. As the researchers note, “we are no longer constrained to a single author or source, and the temptation for NLP is to believe everything that needs knowing can be learned from the written world.” With NLP corpora expanded to include large web-crawls (WS2), deep models for learning transferable representations have advanced on a number of NLP benchmarks.
Unlike dictionaries, which define words in terms of other words, human beings can understand the essential meanings of many basic concepts. For example we directly learn what “heavy” and “soft” are by physically interacting with objects. While current state-of-the-art pretrained language models can generate coherent paragraphs of text, their word and sentence representations often fail to capture such grounded features of words. This is where WS3 (Perception) can help, as this level can include auditory input, tactile senses, and visual inputs.
The researchers propose the nuanced and contextual question “Is an orange more like a baseball or a banana?” to demonstrate World Scope levels, suggesting for example that WS3 can “begin to understand the relative deformability of these objects, but is likely to confuse how much force is necessary given that baseballs are used much more roughly than oranges in widely distributed media.”
Thanks to our rich representation of concepts derivable from perceptions, human beings can approach the question by simply acknowledging known facts — that an orange and a baseball share a similar shape, size and weight; that both oranges and bananas are edible, etc. The richness of our experience and knowledge might not be communicable by language, but it is essential to understanding language. This is what the WS4 (Embodiment) level aims at: “This intuitive knowledge could be acquired by embodied agents interacting with their environment, even before language words are grounded to meanings.”
Finally though, a learned agent needs to be tested. But how? The researchers stress the ultimate importance of coming to understand social context, which includes status, role, intent and countless other variables. The researchers believe AI agents will not be able to obtain social intelligence with only a fixed corpus to draw from. Instead, social interaction will provide a crucial signal for the agents, who will learn by participating.
The study describes an interesting roadmap that tracks the journey of NLP research so far, and points to the contextualization of language in human experiences as a target for future studies.
The paper Experience Grounds Language is on arXiv.
Journalist: Fangyu Cai | Editor: Michael Sarazen