Notes on ForecastBench

ForecastBench is a benchmark for measuring the accuracy of machine learning systems in forecasting questions. These are my notes from their paper titled “ForecastBench: A Dynamic Benchmark of AI Forecasting Capabilities.”

Static evaluation methods using historical data from after a model’s knowledge cutoff have drawbacks:

Benchmarks can quickly become obsolete as the knowledge cutoff gets updated on newer models.
Knowledge cutoff dates are more or less estimates and are inaccurate.
Model makers may exaggerate their accuracy on benchmarks.

Instead, ForecastBench uses a dynamic benchmark that is updated daily as markets are resolved, with new forecasting questions updated every two weeks. This is their data pipeline:

Create 1000 forecasting questions sampled from a much bigger question bank.
The question bank contains questions from multiple reliable prediction markets and datasets:
1. Prediction markets: randforecastinginitiative.org, manifold.markets, metacalculus.com, polymarket.com
2. Datasets: ACLED, DBnomics, FRED, Wikipedia, Yahoo Finance. Questions are created from these datasets using predefined templates.
These questions are filtered and enhanced before being added to the question bank:
1. Removes low liquidity data.
2. Adds more context to each question using a small LLM.
3. Uses templates to add questions based on the datasets.
4. Combines questions ($\binom{N}{2}$ combinations) to form more questions.
The actual questions are sampled from the question bank every two weeks and benchmarks are run across multiple seven baselines: zero-shot prompting, prompting with scratchpad instructions, prompting with scratchpad instructions and retrieved news articles, zero-shot prompting with crowd forecasts, scratchpad prompting with crowd forecasts, scratchpad prompting with retrieved news articles and crowd forecasts, aggregating predictions from multiple LLMs.
In addition to these LLM baselines, they also use two human baselines: 500 human forecasters, 39 “superforecasters”. They use a 200-question random subset of the questions.
The results are finalized using the Brier score as a metric. The current implementation utilizes a difficulty-adjusted Brier score to account for variability in question set difficulty.

The results show that LLMs are only on par (0.122) (or slightly worse) than the median human forecasters (0.121) and significantly worse than expert human superforecasters (0.096).

This suggests that there’s a lot of value to be gained when someone develops an agent/model that can make more accurate forecasts, at least on a level comparable to that of the superforecasters.