In October 1859, a ferocious storm off the coast of Wales sunk the Royal Charter clipper ship with the loss of 450 lives. In the wake of the disaster, scientist and Royal Navy Officer Robert FitzRoy developed a system of weather charts he called “forecasts,” which for the first time provided advance warning of such conditions to improve the safety of ships at sea.
Whether it involves climate, pandemics or economics, the forecasting of future real-world events remains a crucial yet challenging task. Because effective forecasting is based in large part on dynamic information processing, AI researchers are now exploring the possibility of leveraging powerful large-scale language models to automate these tasks.
In the new paper Forecasting Future World Events with Neural Networks, a research team from UC Berkeley, MIT, the University of Illinois, and the University of Oxford presents Autocast, a dataset containing thousands of forecasting questions and an accompanying date-based news corpus for measuring neural network models’ automatic forecasting capabilities; and curate IntervalQA, a dataset of numerical questions and metrics for calibration.
The team summarizes their main contributions as:
- We introduce Autocast, a dataset for forecasting that covers diverse topics (e.g. politics, economics, society, science) and varying time horizons.
- Part of our dataset is a large news corpus organized by date, allowing us to rigorously evaluate model performance on historical forecasts.
- We show that forecasting is challenging for current language models, with accuracy and calibration far below a strong human baseline.
To build their Autocast dataset, the team collected forecasting questions from three public forecasting tournaments — Metaculus, Good Judgment Open, and CSET Foretell — resulting in a total of 6,707 true/false, multiple-choice, or quantity/date questions covering a wide variety of topics (politics, economics and science) with broad public interest.
The researchers first evaluated the QA model UnifiedQA-v2 (Khashabi et al., 2022) and text-to-text framework T5 (Raffel et al., 2020) without retrieval, then evaluated retrieval-based methods to see whether they could improve model performance by selecting relevant articles from the dataset with Autocast.
For retrieval, the team used a Fusion-in-Decoder (FiD, Izacard and Grave, 2021) model to encode articles retrieved by the lexical search method BM25 (Robertson et al., 1994; Thakur et al., 2021) with cross-encoder reranking. Between a given question’s open and close date, the frozen fine-tuned FiD model generates an embedding of each day’s top news article, then feeds these generated embeddings to an autoregressive large language model such as GPT-2.
The results show that retrieval-based methods with Autocast substantially outperform UnifiedQA-v2 and T5 and become more effective as the parameter count increases, indicating that larger models can better learn to extract relevant information from retrieved articles than smaller models.
Overall, this work shows that language models can be effectively trained on past forecasting questions by retrieving from a large news corpus. Although the results remain below the human expert baseline, increasing model size and enhancing information retrieval can improve performance. The team believes Autocast’s novel approach for enabling large language models to forecast future world events could bring significant practical benefits across a wide range of applications.
Author: Hecate He | Editor: Michael Sarazen
We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.