To help make world’s largest free scientific paper repository even more accessible, arXiv announced yesterday that all of its research papers are now available on Kaggle.
Launched in 1991 by Paul Ginsparg as a preprint physics archive, arXiv is hosted by Cornell University and has become an indispensable platform providing free and open access to research for the computer science and machine learning communities and beyond. Its new collaboration with Kaggle, the world’s largest data science community, provides a free and open pipeline to the machine-readable arXiv dataset of some 1.7 million articles.
The Kaggle dataset mirrors the original arXiv paper data, with each entry including:
id: ArXiv ID (can be used to access the paper)
submitter: Who submitted the paper
authors: Authors of the paper
title: Title of the paper
comments: Additional info, such as number of pages and figures
journal-ref: Information about the journal the paper was published in
doi: [https://www.doi.org](Digital Object Identifier)
abstract: The abstract of the paper
categories: Categories / tags in the arXiv system
versions: A version history
“By offering the dataset on Kaggle we go beyond what humans can learn by reading all these articles and we make the data and information behind arXiv available to the public in a machine-readable format,” said arXiv Executive Director Eleonora Presani in a press release.
Kaggle is a regular destination for data scientists and machine learning engineers seeking interesting datasets, public notebooks, information on competitions and so on. Researchers can utilize Kaggle’s extensive data exploration tools to share relevant scripts and outputs with others.
As a burgeoning knowledge-sharing platform, arXiv benefits from constant innovation regarding information presentation and interpretation, and Presani believes additional input from Kaggle’s massive user base can help push the limits of this innovation.
It’s hoped the arXiv and Kaggle collaboration will empower new use cases and lead to the exploration of richer machine learning techniques that combine multi-modal features in applications such as trend analysis, paper recommender engines, category prediction, co-citation networks, knowledge graph construction, semantic search interfaces and more.
The arXiv dataset is now available on Kaggle and will be updated weekly.
Reporter: Yuan Yuan | Editor: Michael Sarazen
We know you don’t want to miss any story. Subscribe to our popular Synced Global AI Weekly to get weekly AI updates.