The ARC Prize Foundation just released ARC-AGI-3. It is the first interactive reasoning benchmark that tests whether AI can actually adapt to novel problems in real time.
The results:
- Gemini 3.1 Pro: 0.37%
- GPT-5.4: 0.26%
- Claude Opus 4.6: 0.25%
- Humans: 100%
Read those numbers again. The most advanced AI systems on Earth, the ones we are told are approaching human-level intelligence, scored less than half a percent on tasks that every human participant solved completely.
What ARC-AGI-3 actually tests
Previous benchmarks test what AI is good at: pattern matching on data similar to training data. ARC-AGI-3 tests something different: can the system adapt to a problem it has never seen before, in real time, with interaction?
This is what “reasoning” actually means. Not retrieving a memorized pattern. Not applying a learned heuristic. Genuinely figuring something out for the first time.
Humans do this effortlessly. We solve novel problems constantly. We adapt. We improvise. We look at something we have never seen and figure out what to do.
Every frontier model fails at this. Not partially. Almost completely.
Why this matters more than any benchmark you have seen
Most AI benchmarks are gameable. Train on enough data, and the model will have seen something similar to the test. Scores go up. Press releases go out. “Human-level performance achieved.”
ARC-AGI-3 was designed to be ungameable. The problems are novel by construction. You cannot memorize your way to a solution. You have to think.
And the models cannot think. Not in the way that word actually means.
The $850,000 question
There is an $850,000 prize pool for whoever can build a system that scores 100%. $700,000 for the winner. The fact that nobody is close to claiming it tells you where we actually are, versus where the marketing says we are.
What I think about it
I use AI every day. It is extraordinarily useful. It writes code, summarizes research, generates ideas, handles tasks that would take me hours. I am not an AI skeptic.
But I am an honest builder. And the honest truth is: these systems do not reason. They pattern-match at a scale and speed that feels like reasoning. When the pattern is in the training data, the results are magical. When it is not, they score 0.25%.
The gap between what AI can do and what we claim it can do has never been wider. Not because the technology is bad. Because the marketing is too good.
$850K is on the table if you can build a system that actually reasons. arcprize.org
[Draft: Awaiting Carlos’s twist]