The number of AI-related research papers has skyrocketed in recent years, and the burden this has placed on reviewers at major academic conferences has been well-documented. The trend shows no sign of slowing, and this led a bold Carnegie Mellon University (CMU) team to explore the prospect of using AI to review AI papers.
Yes, the researchers acknowledge that automating paper reviews is a “crazy idea,” but the Review Advisor project introduced in their paper Can We Automate Scientific Reviewing? has revealed some interesting things. Here’s a very meta example — a review of the new CMU paper that was generated by the paper-review model itself:
“This paper proposes to use NLP models to generate reviews for scientific papers. The model is trained on the ASAP-Review dataset and evaluated on a set of metrics to evaluate the quality of the generated reviews. It is found that the model is not very good at summarizing the paper, but it is able to generate more detailed reviews that cover more aspects of the paper than those created by humans. The paper also finds that both human and automatic reviewers exhibit varying degrees of bias and biases, and that the system generate more biased reviews than human reviewers.”
As seen above and in experiment results, the proposed ReviewAdvisor can often capture and explain a paper’s core idea with some precision. Overall, however, the researchers concede that the system also tends to generate non-factual statements in its paper assessments, “which is a serious flaw in a high-stakes setting such as reviewing.” They suggest that such systems could still be useful by providing a starting point for human reviewers and potentially guiding junior reviewers.
The team approached the challenge of automating paper reviews by first defining what a good review is. They referenced guidelines from top academic conferences such as ICML, NeurIPS, ICLR and other resources, summarizing the most frequently mentioned qualities of a good review as follows:
- Decisiveness: A good review should take a clear stance, selecting high-quality submissions for publication and suggesting others not be accepted.
- Comprehensiveness: A good review should be well-organized, typically starting with a brief summary of the paper’s contributions, then following with opinions gauging the quality of a paper from different aspects.
- Justification: A good review should provide specific reasons for its assessment, particularly whenever it states that the paper is lacking in some aspect.
- Accuracy: A review should be factually correct, with the statements contained therein not being demonstrably false.
- Kindness: A good review should be kind and polite in language use
The researchers then build a dataset, ASAP-Review (Aspect-enhanced Peer Review), comprising machine learning papers from ICLR (2017-2020) and NeurIPS papers from 2016-2019, with each paper’s reviews and decisions also included.
But how to annotate such a dataset, since reviews often involve both objective and subjective aspects?
The team proposed review generation could also be a task of aspect-based scientific paper summarization. Unlike previous works designed to generate automatic summarizations for scientific papers, ReviewAdvisor also assesses papers’ specific aspects. Following the ACL (Association of Computational Linguistics) review guidelines, the researchers identified eight aspects for annotators to take into account: Summary (SUM), Motivation/Impact (MOT), Originality (ORI), Soundness/Correctness (SOU), Substance (SUB), Replicability (REP), Meaningful Comparison (CMP) and Clarity (CLA).
Human annotators processed about five percent of the dataset’s more 20,000 reviews and an aspect-tagger based on a pretrained BERT language model and a multi-layer perceptron annotated the remainder. Because BART, a denoising autoencoder for pretraining sequence-to-sequence models, has shown excellent performance on multiple generation tasks, the team used pretrained BART models in the last step of their workflow to process input papers and generate reviews.
The researchers observed that the system-generated reviews covered more aspects of a paper than those created by humans and also referenced supporting sentences from the paper. They suggest this ability could deliver preliminary templates to help reviewers more quickly locate critical information in papers.
Reporter: Fangyu Cai | Editor: Michael Sarazen