Every three months, participants in the Metaculus forecasting cup try to predict the future for a prize pot of about $5,000. Metaculus, a forecasting platform, poses questions of geopolitical importance such as “Will Thailand experience a military coup before September 2025?” and “Will Israel strike the Iranian military again before September 2025?”
Forecasters estimate the probabilities of the events occurring—a more informative guess than a simple “yes” or “no”—weeks to months in advance, often with remarkable accuracy. Metaculus users correctly predicted the date of the Russian invasion of Ukraine two weeks in advance and put a 90 percent chance of Roe v. Wade being overturned almost two months before it happened.
[time-brightcove not-tgx=”true”]
Still, one of the top 10 finishers in the Summer Cup, whose winners were announced Wednesday, was surprising even to the forecasters: an AI. “It’s actually kind of mind blowing,” says Toby Shevlane, CEO of Mantic, the recently-announced UK-based startup that developed the AI. When the competition opened in June, participants predicted that the top bot’s score would be 40% of the top human performers’ average. Instead, Mantic achieved over 80%.
“Forecasting—it’s everywhere, right?” says Nathan Manzotti, who has worked on AI and data analytics for the Department of Defence and General Services Administration, along with about a half dozen U.S. government agencies. “Pick a government agency, and they definitely have some kind of forecasting going on.”
Forecasters help institutions anticipate the future, explains Anthony Vassalo, co-director of the Forecasting Initiative at RAND, a US government think tank. It also helps them change it. Forecasting geopolitical events weeks or months in advance helps “stop surprise” and “assist decision makers in being able to make decisions,” Vassalo says. Forecasters update their predictions based on policies enacted by lawmakers, so they can predict how a hypothetical policy intervention is likely to change future outcomes. If decision makers are on an undesirable track, forecasters can help them “change the scenario they’re in,” says Vassalo.
But forecasting broad geopolitical questions is notoriously hard. Forecasts from top forecasters can take days and tens of thousands of dollars for a single question. For organizations like RAND, tracking multiple topics across many geopolitical zones, “it would take months to have human forecasters do an initial forecast on all those questions, let alone update them regularly,” said Vassalo.
Machine learning has long been useful in domains with copious, well-structured data, like weather forecasting or quant fund trading. When forecasting geopolitics or technological advancements, “you will have a lot of complex, interdependent factors that human judgment can be both more accessible and affordable” in predicting, says Deger Turan, CEO of Metaculus.
Large language models work with the same messy information as human forecasters, and are able to simulate this human judgment. They are also improving in much the same way that humans do: by making predictions on many questions, seeing how they play out, and updating their forecasting methods based on the outcomes—on a much larger scale than humans are capable of.
“Our main insight was actually predicting the future tends to be a verifiable problem, because that’s like, how humans learn, right?” says Ben Turtel, CEO of LightningRod, which develops AIs for forecasting that have placed competitively in Metaculus AI tournaments. The company trained a recent model on 100,000 forecasting questions.
The training that AIs receive is showing up in the rankings. In June, the top-ranked bot, built by Metaculus on top of OpenAI’s o1 reasoning model, came 25th in the cup. This time, Mantic is eighth out of 549 contestants—the first time a bot has placed in the top 10 in the competition series.
The result should be taken with a grain of salt, according to Ben Wilson, an engineer at Metaculus who runs comparisons of AIs and humans on forecasting challenges. The contest contains a relatively small sample of 60 questions. Moreover, most of the 600 contestants are amateurs, some of whom only predict a handful of questions in the tournament, leaving them with a low score.
Finally, the machines have an unfair advantage. Participants win points not only for accuracy, but also “coverage”—how early they make predictions, how many questions they make predictions on, and how often they update their estimates. An AI that is less accurate than human competitors can still do well in the rankings by constantly updating its estimates in response to emerging news, in a way that is unfeasible for humans.
For Vassalo, AIs’ unfair advantage solves his biggest remaining problem: getting high quality forecasts across all of the questions he needs predictions for. “I actually don’t need it to be able to get to the level of a superforecaster,” he says, using the moniker given to the top forecasters. “I need it to be as good as the crowd.”
This is harder than it sounds: the Metaculus Community prediction, an aggregate of all users’ forecasts on every question, is one of the most consistent performers on the platform. If it were a person, it would rank fourth on the site—such is the wisdom of the crowd. In the Quarterly Cup, Mantic trailed the Community Prediction by five places.
A reliable AI forecaster could track hundreds of questions simultaneously, allowing Vassalo to deploy top human forecasters only against those questions that the AI deems worthy of further scrutiny.
“The one thing about forecasting, or predictive analytics, is that it’s decision support,” says Manzotti. “A lot of leadership will throw the data out the window if they have a gut feeling in a different direction.” That’s a problem that AI can’t solve.