To help make world’s largest free scientific paper repository even more accessible, arXiv announced yesterday that all of its research papers are now available on Kaggle.
Launched in 1991 by Paul Ginsparg as a preprint physics archive, arXiv is hosted by Cornell University and has become an indispensable platform providing free and open access to research for the computer science and machine learning communities and beyond. Its new collaboration with Kaggle, the world’s largest data science community, provides a free and open pipeline to the machine-readable arXiv dataset of some 1.7 million articles.
The Kaggle dataset mirrors the original arXiv paper data, with each entry including:
id
: ArXiv ID (can be used to access the paper)submitter
: Who submitted the paperauthors
: Authors of the papertitle
: Title of the papercomments
: Additional info, such as number of pages and figuresjournal-ref
: Information about the journal the paper was published indoi
: [https://www.doi.org](Digital Object Identifier)abstract
: The abstract of the papercategories
: Categories / tags in the arXiv systemversions
: A version history
“By offering the dataset on Kaggle we go beyond what humans can learn by reading all these articles and we make the data and information behind arXiv available to the public in a machine-readable format,” said arXiv Executive Director Eleonora Presani in a press release.
Kaggle is a regular destination for data scientists and machine learning engineers seeking interesting datasets, public notebooks, information on competitions and so on. Researchers can utilize Kaggle’s extensive data exploration tools to share relevant scripts and outputs with others.
As a burgeoning knowledge-sharing platform, arXiv benefits from constant innovation regarding information presentation and interpretation, and Presani believes additional input from Kaggle’s massive user base can help push the limits of this innovation.
It’s hoped the arXiv and Kaggle collaboration will empower new use cases and lead to the exploration of richer machine learning techniques that combine multi-modal features in applications such as trend analysis, paper recommender engines, category prediction, co-citation networks, knowledge graph construction, semantic search interfaces and more.
The arXiv dataset is now available on Kaggle and will be updated weekly.
Reporter: Yuan Yuan | Editor: Michael Sarazen
We know you don’t want to miss any story. Subscribe to our popular Synced Global AI Weekly to get weekly AI updates.

Pingback: [N] ArXiv’s 1.7M+ Research Papers Now Available on Kaggle - Sixpost News
Ra mắt vào năm 1991 bởi Paul Ginsparg như một kho lưu trữ vật lý in sẵn, arXiv được lưu trữ bởi Đại học Cornell và đã trở thành một nền tảng không thể thiếu cung cấp truy cập miễn phí và mở cho nghiên cứu cho các cộng đồng máy tính và học máy và hơn thế nữa.
Pingback: ArXiv’s 1.7M+ Research Papers Now Available on Kaggle - GistTree
Pingback: ArXiv's 1.7M+ Research Papers Now Available on Kaggle > Seekalgo
Pingback: #100DaysOfCode D33: casual Friday – /ˈdeɪtə/
Pingback: ArXiv’s 1.7M+ Research Papers Now Available on Kaggle | Synced – Quantum and Photonics Systems
Pingback: DSB #100 (Speciál) – Data Science Bulletin