The TL;DR initialism (“too long; didn’t read”) has flooded social media, where it is used to politely or pointedly inform someone that whatever content they’ve posted will not be read because there’s just too much of it. This is also often the case with scientific papers, some of which can bore even a serious researcher to tears.
Now there’s help. A team from the Allen Institute for Artificial Intelligence and the University of Washington this week introduced TLDR generation, a new automatic summarization task for scientific papers. The researchers also provide an associated dataset and propose a multitask learning approach for generating TLDR using pretrained language models.
Identifying research highlights in a paper while maintaining context and meaning requires expert background knowledge and complex domain-specific language understanding. TLDRs present an extreme summary of a scientific paper and are an alternative to paper abstracts. TLDRs leave out nonessential background or methodological details and capture the key important aspects of the paper such as its main contributions.
To support TLDR generation, the researchers introduced SCITLDR — a multi-target dataset comprising 3,935 TLDRs of scientific articles in the Computer Science domain. These include author-written summaries from the OpenReview publishing platform and a test set augmented with human-written summaries derived entirely from papers’ peer review comments.
Most summarization datasets provide a single “gold summary” for a given document, which the research team considered overly simplistic — especially for a high compression task like TLDR generation of scientific papers. The SCITLDR dataset instead provides multiple gold TLDRs for each paper in the test set — one written by the paper authors and the other(s) derived from peer review comments.
The researchers also propose a training strategy for adapting pretrained language models. The strategy exploits similarities between TLDR generation and the related task of title generation, and outperformed both extractive and abstractive state-of-the-art summarization baselines in tests.
The research team is encouraging the broader scientific community to contribute to the project. Potential future research directions include explicitly modelling readers’ background knowledge to create personalized TLDRs, leveraging the wide use of scientific TLDRs on social media, and expanding to other languages and scientific domains.
The paper TLDR: Extreme Summarization of Scientific Documents is on arXiv.
Journalist: Yuan Yuan | Editor: Michael Sarazen