In a boon to AI researchers, the last year witnessed an unprecedented open-sourcing of large datasets by popular AI research projects.
In the conclusion to our year-end series, Synced spotlights ten datasets that were open sourced in 2018 and takes a peek into the papers behind them. We hope this list can provide the AI community with insight into what 2019 might hold in store for big data.
Open Images V4
From Google AI, open-sourced on April 30.
Open Images V4 is a dataset of images with unified annotations for image classification, object detection and visual relationship detection. The dataset contains 15.4M bounding-boxes for 600 categories on 1.9M images, making it the largest existing dataset with object location annotations.
The paper Open Images Dataset V4: Unified Image Classification, Object Detection, and Visual Relationship Detection at Scale is on arXiv.
Stanford University ML Group led by Stanford Professor and famed machine learning guru Andrew Ng released the dataset on May 24.
MURA (musculoskeletal radiographs) is a large dataset of bone X-rays that can be used to train algorithms tasked with detecting abnormalities in X-rays. MURA is believed to be the world’s largest public radiographic image dataset with 40,561 labeled images. After open sourcing MURA, the Stanford research team created a competition to see if community models could perform as well as radiologists on abnormality detection tasks. So far, the best performance from models (0.843) has trumped humans (0.778).
The paper MURA: Large Dataset for Abnormality Detection in Musculoskeletal Radiographs is on arXiv.
From UC Berkeley BAIR , Georgia Institute of Technology, Peking University, Uber AI Labs. Open-sourced May 30.
BDD100K is a driving dataset which is an order of magnitude larger than previous efforts, comprising videos with diverse kinds of annotations including image level tagging, object bounding boxes, drivable areas, lane markings, and full-frame instance segmentation. The dataset possesses geographic, environmental, and weather diversity, which is useful for training models so that they are less likely to be surprised by new conditions.
The dataset contains over 100k videos of driving experience, each running 40 seconds at 30 frames per second. The total image count is 800 times larger than Baidu ApolloScape (released March 2018), 4,800 times larger than Mapillary and 8,000 times larger than KITTI. The videos were collected from some 50k trips on the streets and highways of New York, San Francisco Bay Area, etc. and come with GPS/IMU information illustrating driving paths. They were recorded at different times of the day and in various weather conditions, including sunny, overcast, and rainy.
The paper BDD100K: A Diverse Driving Video Database with Scalable Annotation Tooling is on arXiv.
From Stanford University Computer Science Department, released June 11.
The Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage; or the question might be unanswerable. SQuAD2.0 combines the 100,000 questions in SQuAD1.1 with over 50,000 new, unanswerable questions written adversarially by crowdworkers to look similar to answerable ones.
SQuAD 2.0 is a challenging natural language understanding task for existing models. For example a strong neural system that is able to score 86 percent F1 on SQuAD 1.1 achieves only 66 percent F1 on SQuAD 2.0.
The paper Know What You Don’t Know: Unanswerable Questions for SQuAD is on arXiv.
From Stanford University Computer Science Department, open-sourced August 21.
CoQA is a large-scale dataset for building Conversational Question Answering systems. The CoQA challenge measures the ability of machines to understand a text passage and answer a series of interconnected questions that appear in a conversation.
CoQA contains 127k questions with answers, obtained from 8k conversations about text passages from seven diverse domains. The questions are conversational and the answers are free-form text with their corresponding evidence highlighted in the passage.
The paper CoQA: A Conversational Question Answering Challenge is on arXiv.
Yale University, September 24.
Spider is a large-scale complex and cross-domain semantic parsing and text-to-SQL dataset annotated by 11 Yale students. The goal of the Spider challenge is to develop natural language interfaces to cross-domain databases.
Spider consists of 10,181 questions and 5,693 unique complex SQL queries on 200 databases with multiple tables covering 138 different domains. In Spider 1.0, different complex SQL queries and databases appear in train and test sets.
The paper Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task is on arXiv.
Carnegie Mellon, Stanford University, and the Montreal Institute of Learning Algorithms released the dataset September 25.
HotpotQA is a question answering dataset featuring natural, multi-hop questions, with strong supervision for supporting facts to enable more explainable question answering systems. The dataset is composed of 113,000 QA pairs based on Wikipedia, one of the world’s largest and most reliable general reference websites.
The paper HotspotQAA Dataset for Diverse, Explainable Multi-hop Question Answering is on arXiv.
Tencent ML – Images
Tencent AI Lab released the dataset on October 18
Tencent ML – Images is the largest open-source multi-label image dataset, including 17,609,752 training and 88,739 validation image URLs which are annotated with up to 11,166 categories; and a Resnet-101 model pre-trained on ML-Images which achieves the top-1 accuracy of 80.73 percent on ImageNet via transfer learning.
More information on Tencent ML – Images can be found at GitHub.
Tencent AI Lab Embedding Corpus for Chinese Words and Phrases
The Tencent AI Lab dataset was open-sourced October 19
The dataset aims to provide large-scale and high-quality support for deep learning-based Chinese language NLP research in both academic and industrial applications. The corpus provides 200-dimension vector representations, a.k.a. embeddings, for over 8 million Chinese words and phrases, which are pre-trained on large-scale, high-quality data. These vectors, capturing semantic meanings for Chinese words and phrases, can be widely applied in many downstream Chinese processing tasks (e.g., named entity recognition and text classification) and in further research.
The paper Directional Skip-Gram: Explicitly Distinguishing Left and Right Context for Word Embeddings is here.
The NYU School of Medicine and Facebook AI Research released the largest-ever open-source MRI dataset to speed up MRIs on November 26.
The dataset includes more than 1.5 million anonymous MRI images of the knee, drawn from 10,000 scans, and raw measurement data from nearly 1,600 scans. It is the result of a collaboration between the NYU School of Medicine Department of Radiology’s Center for Advanced Imaging Innovation and Research (CAI2R) and Facebook AI Research (FAIR), aimed at sharing open source tools and spurring the development of AI systems to make MRI scans 10 times faster.
The paper fastMRI: An Open Dataset and Benchmarks for Accelerated MRI is on Arxiv.
Another huge bonus this year for data-hungry AI researchers was Google’s new “Dataset Search” tool, which enables a quick and easy survey of datasets stored across the Web through simple keyword search. Says Google: “We believe that this project will have the additional benefits of a) creating a data sharing ecosystem that will encourage data publishers to follow best practices for data storage and publication and b) giving scientists a way to show the impact of their work through citation of datasets that they have produced.
“As more dataset repositories use schema.org and similar standards to describe their datasets, the variety and coverage of datasets that users find in Dataset Search, will continue to grow.”
Journalist: Fangyu Cai | Editor: Michael Sarazen