Unveiling the Truth Behind Safe AI's Forecast Claims

Will there be substantive issues with Safe AI's claim to forecast better ...

Some kind of way it is not true in a common sense way. Things that would resolve this yes (draft):We look at the data and it turns out that information was leaked to the LLM somehowThe questions were selected in a way that chose easy questionsThe date of forecast was somehow chosen so as to benefit the LLMThis doesn't continue working over the next year of questions (more accurately than last year of metaculus crowd, ie the crowd can't win because it gets more accurate)The AI was just accessing forecasts and parroting them.

Here is the current state of this discussion using votes from here and LessWrong. Seems like there is a lot we agree on.

Thread finds much worse performance and names a few issues. The results in "LLMs Are Superhuman Forecasters" don't hold when given another set of forecasting questions. I used their codebase (models, prompts, retrieval, etc.) to evaluate a new set of 324 questions—all opened after November 2023. Findings: Their Brier score: .195 Crowd Brier score: .141

First issue:

The authors assumed that GPT-4o/GPT-4o-mini has a knowledge cut-off date of October 2023. However, this is not correct. For example, GPT-4o knows that Mike Johnson replaced Kevin McCarthy as speaker of the house. 1. This event happened at the end of October. 2. This also happens to be a question in the Metaculus dataset.

I made a poll to test the views of this comment section (and possibly LessWrong) so we can figure out ways to go forward. It takes 2 minutes to fill in. Do we want a new market on whether it will beat the crowd on future questions, somehow?

Reflecting on the initial probability, it's important to consider the base rate of any two specific teams meeting in the Super Bowl. Historically, the probability of any two specific teams from the same conference meeting in the Super Bowl is quite low due to the number of variables and potential upsets in the playoffs. The Chiefs and Bills are both top contenders, but the AFC's competitiveness and the single-elimination format of the playoffs reduce the likelihood of both teams making it through. The initial probability of 0.08 (8%) seems reasonable given the strengths of both teams but also the inherent uncertainties and challenges they face. Considering the base rates and the specific strengths and challenges of the Chiefs and Bills, the final probability should be slightly adjusted to account for the competitive nature of the AFC and the playoff structure.

Finally, in the released code base, before forecasting we validate the time again (and also reject all of the articles with unknown time)

As an extra check, we also had GPT-4o look over all of the articles we used for each Metaculus forecast, checking whether the publish date or the content of the article leaked information past the forecast date of the model. We also manually looked through several dozen examples. We could not find any instances of contamination from news articles provided to the model. We also couldn't find any instances of prediction market information being included in model sources.

In light of the comments/scrutiny we've received, and the extra checks we did here, I'm much more confident in the veracity of our work. I commit to betting $1K mana on NO at market price over the next 24h to signal this. There were a few other sources of skepticism which we have not directly checked, such as claims that GPT-4o has factual knowledge which extends beyond its pretraining cutoff, though I am skeptical that a phenomenon like this will turn out to exist in a way which would significantly affect our results. The fact that our forecasts are retroactive always opens up the possibility for issues like this, and the gold standard is of course prospective forecasting, but I think we've managed to sanity-check and block sources of error to a reasonable degree.

I'm not sure what standard this market will use to decide whether/how stuff like Platt scaling/scoring operationalizations might count as a "substantive issue," but I'm decently confident that the substance of our work will hold up to historical scrutiny: scaffolded 2024-era language models appear to perform at/above the human crowd level on the Metaculus distribution of forecasting questions, within the bounds of the limitations we have described (such as poor model performance close to resolution). We also look forward to putting a more polished report with additional results on the arXiv at some point.

Long has mentioned that he's happy to answer further questions about the system by email (to [email protected]), but we're not expecting to post further clarifications here in the interest of time.

I appreciate you being willing to bet here. If you would like to nominate a trusted mutual party as arbitrator, let me know.

We may have a shared misunderstanding of the magnitude of his offer to bet. I thought he meant he will spend $1k USD, it seems he means 1000 mana, ie, $1 😂, which he has already bet since making this comment.

National Chicken Council | Questions & Answers about Salmonella

We're pretty sure that contamination from news articles was not an issue in our reported Metaculus evals.

Here are guardrails we used:

We inject before:{date} into search queries
We use news search instead of standard search (this will exclude websites like Wikipedia, etc.), which are more time-bound and where edits post-publication are clearly marked
We publicly forked newspaper4k to look for updated time beside merely the created time of each article to make sure we correctly filtered based on updated time

Long has mentioned that he's happy to answer further questions about the system by email (to [email protected]), but we're not expecting to post further clarifications here in the interest of time.

I appreciate you being willing to bet here. If you would like to nominate a trusted mutual party as arbitrator, let me know.