In the ever-expanding world of computer hardware and software, benchmarks provide a robust method for comparing quality and performance across different system architectures. From MNIST to ImageNet to GLUE, benchmarks have also come to play a hugely important role in driving and measuring progress in AI research.
When introducing any new benchmark, it’s generally best not to make it so easy that it will quickly become outdated, or so hard that everyone will simply fail. When new models bury benchmarks, which is happening faster and faster in AI these days, researchers must engage in the time-consuming work of making new ones. Facebook believes that the increasing benchmark saturation in recent years — especially in natural language processing (NLP) — means it’s time to “radically rethink the way AI researchers do benchmarking and to break free of the limitations of static benchmarks.”
Their solution is a new research platform for dynamic data collection and benchmarking called Dynabench, which they propose will offer a more accurate and sustainable way for evaluating progress in AI.
In a blog post, the Facebook researchers identify a number of other problems associated with static benchmarks. They may for example contain inadvertent biases or annotation artifacts, and they may encourage or even force the research community to focus too much on one specific metric or task.
The Dynabench approach puts both humans and state-of-the-art (SOTA) AI models “in the loop” to create challenging new datasets and to measure for example how often models make mistakes when humans attempt to fool them. By adapting to model responses, Dynabench can challenge these models in ways that a static test can’t. Consider a student who has only memorized a large set of facts. That strategy might ace a written exam, but would be less effective with the probing and unanticipated questions delivered in an oral exam. This how Dynabench aims to challenge models.
The researchers say anyone can use the platform to find or validate model-fooling examples, and they hope to also provide linguists and other experts in related fields with the tools they need to discover weaknesses in AI systems. The crowd workers will be connected to the platform via Mephisto, to generate additional data and to validate examples in the same manner.
As researchers, engineers, computational linguists, and others use Dynabench to gauge the performance of their models, the platform will track which examples fool models and lead to incorrect predictions. This “dynamic adversarial data collection” procedure can be used to improve systems and will contribute to new, more challenging Dynabench datasets and training for the next generation models. These new models can in turn be benchmarked with Dynabench, creating a “virtuous cycle” of AI research progress.
The debut Dynabench iteration centers on four official core tasks, focused on NLP because it currently suffers most from rapid benchmark saturation. There will be multiple rounds of evaluation for each task. Facebook has partnered with academic researchers from UNC–Chapel Hill, UCL, and Stanford, who will each be the “owner” of a particular task.
Although Dynabench is currently English-only and focused on language and text, the researchers are open to expanding to other languages and are very interested in opening the project up to other modalities as well.
Dynabench can be considered as a scientific experiment to accelerate progress in AI research. The researchers say they hope it will help the AI community build systems that make fewer mistakes, are less subject to harmful biases, and are more useful and beneficial to people in the real world.
Reporter: Yuan Yuan | Editor: Michael Sarazen
Synced Report | A Survey of China’s Artificial Intelligence Solutions in Response to the COVID-19 Pandemic — 87 Case Studies from 700+ AI Vendors
This report offers a look at how China has leveraged artificial intelligence technologies in the battle against COVID-19. It is also available on Amazon Kindle. Along with this report, we also introduced a database covering additional 1428 artificial intelligence solutions from 12 pandemic scenarios.
Click here to find more reports from us.
We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.
Pingback: Facebook’s Dynabench ‘Radically Rethinks AI Benchmarking’ – Full-Stack Feed
Pingback: [R] Facebook’s Dynabench ‘Radically Rethinks AI Benchmarking’ – tensor.io