In response to the COVID-19 pandemic, the White House on Monday joined a number of research groups to announce the release of the COVID-19 Open Research Dataset (CORD-19) of scholarly literature about COVID-19, SARS-CoV-2, and the Coronavirus group. The release came with an urgent call to action to the world’s AI experts to “develop new text and data mining techniques that can help the science community answer high-priority scientific questions related to COVID-19.”
A publicly available and machine readable dataset, CORD-19 consists of over 29,000 scholarly articles, including over 13,000 with full text about COVID-19, SARS-CoV-2, and related coronaviruses.
Worldwide total confirmed cases of COVID-19 have surged to 190,535 as of March 17 according to a Johns Hopkins University map. The World Health Organization (WHO) yesterday said the total number of cases and deaths outside China had overtaken the total number of cases in China.
Meanwhile, the rapid acceleration in new coronavirus literature has researchers struggling to keep up.
“It’s difficult for people to manually go through more than 20,000 articles and synthesize their findings. Recent advances in technology can be helpful here. We’re putting machine readable versions of these articles in front of our community of more than 4 million data scientists. Our hope is that AI can be used to help find answers to a key set of questions about COVID-19,” said Anthony Goldbloom, Co-Founder and Chief Executive Officer at Kaggle.
The CORD-19 dataset challenge hosted on Kaggle defines 10 tasks based on key scientific questions developed in coordination with the WHO and the National Academies of Sciences, Engineering, and Medicine’s Standing Committee on Emerging Infectious Diseases and 21st Century Health Threats. Questions include for example “What is known about transmission, incubation, and environmental stability?” and “What has been published about information sharing and inter-sectoral collaboration?”
The tasks are detailed on Kaggle. Submissions must be contained in a single notebook made public on or before the submission deadline. Participants are free to use other datasets in addition to CORD-19, but those datasets must also be publicly available on either Kaggle, Allen.ai, or Semantic Scholar.
The CORD-19 dataset was built by the Allen Institute for Artificial Intelligence, the Chan Zuckerberg Initiative, Georgetown University’s Center for Security and Emerging Technology, Microsoft Research, the National Library of Medicine – National Institutes of Health, and the Kaggle AI platform owned by Google — in coordination with The White House Office of Science and Technology Policy.
“It’s all-hands on deck as we face the COVID-19 pandemic,” said Dr. Eric Horvitz, Chief Scientific Officer at Microsoft. “We need to come together as companies, governments, and scientists and work to bring our best technologies to bear across biomedicine, epidemiology, AI, and other sciences. The COVID-19 literature resource and challenge will stimulate efforts that can accelerate the path to solutions on COVID-19.”
Journalist: Yuan Yuan | Editor: Michael Sarazen