I was wrong and the AI corrected me
At the start of the year I introduced my AI benchmark for 2026, or at least the first one I was planning to run. That post contained a mistake and this is why the 80 models tested not so well on the first problem. I was recently performing the analyses for a followup on that competition when one of the models actually insisted that I was wrong and showed me the mistake. It turns out that I was indeed wrong โ can you spot the mistake in that first article?
The mistake ๐
In the results article I said I triple checked the math for the square grid problem โ arrange the numbers in a square spiral and sum the neighbors of 2026. And I did, I checked that the corners follow the correct pattern and that the sum of the neighbors is what I was looking for. That is, I first checked that the bottom right corner contains squares of odd numbers:
17 16 15 14 13
18 5 4 3 12
19 6 1 2 11
20 7 8 9 10
21 22 23 24 25 ...
Indeed, 1, 9, 25 are squares of odd numbers, and so are 2025 and 1849. This is why I said that the neighborhood of 2026 is the following:
... 1850 2027 2212 ...
... 1849 2026 2211 ...
... 2024 2025 2210 ...
Indeed, as I triple checked, the sum of these numbers is 16408. And this is the number I was expecting as the solution. But, do you see the mistake?
After the square of the odd number I went up instead of to the right. In the first spiral above, 2, 10, and 26 are all to the right of the square corner, but in this neighborhood, 2026 is upwards. The correct grid is:
... 1850 2027 2212 ...
... 2025 2026 2211 ...
... 2208 2209 2210 ...
And now, the correct answer is 16952. Now, we need to regrade all the LLMs, they were not as bad as I said.
The correct results ๐
I will jut copy the tables from the results article, but corrected. Iโll mark with bold the numbers that have been updated (only in the square column):
| Model | square | hexagon |
|---|---|---|
| chatgpt-4o-latest | 0 | 2 |
| gpt-3.5-turbo | 0 | 0 |
| gpt-4 | 0 | 0 |
| gpt-4-0125-preview | 0 | 0 |
| gpt-4-1106-preview | 0 | 0 |
| gpt-4-turbo | 0 | 0 |
| gpt-4.1 | 1 | 3 |
| gpt-4.1-mini | 3 | 6 |
| gpt-4.1-nano | 0 | 0 |
| gpt-4o | 0 | 0 |
| gpt-4o-mini | 0 | 0 |
| gpt-5 | 9 | 10 |
| gpt-5-mini | 10 | 9 |
| gpt-5-nano | 9 | 9 |
| o1 | 7 | 4 |
| o3 | 4 | 6 |
| o3-mini | 10 | 8 |
| o4-mini | 9 | 5 |
| openai/gpt-oss-120b | 8 | 8 |
| openai/gpt-oss-120b-Turbo | 8 | 1 |
| openai/gpt-oss-20b | 10 | 3 |
| gemini/gemini-2.0-flash | 0 | 0 |
| gemini/gemini-2.0-flash-exp | 0 | 0 |
| gemini/gemini-2.0-flash-lite | 0 | 1 |
| gemini/gemini-2.5-flash | 8 | 5 |
| gemini/gemini-2.5-flash-lite-preview-09-2025 | 1 | 0 |
| gemini/gemini-2.5-flash-preview-09-2025 | 5 | 2 |
| gemini/gemini-2.5-pro | 5 | 2 |
| gemini/gemini-3-pro-preview | 10 | 10 |
| gemini/gemini-flash-latest | 7 | 4 |
| gemini/gemini-flash-lite-latest | 0 | 0 |
| google/gemma-3-12b-it | 0 | 0 |
| google/gemma-3-27b-it | 0 | 0 |
| google/gemma-3-4b-it | 0 | 0 |
| anthropic/claude-3-7-sonnet-latest | 0 | 1 |
| anthropic/claude-3-haiku-20240307 | 0 | 0 |
| anthropic/claude-haiku-4-5-20251001 | 0 | 3 |
| anthropic/claude-opus-4-0 | 0 | 2 |
| anthropic/claude-opus-4-1-20250805 | 0 | 1 |
| anthropic/claude-sonnet-4-0 | 0 | 1 |
| anthropic/claude-sonnet-4-5 | 0 | 0 |
| mistral/codestral-latest | 0 | 0 |
| mistral/ministral-3b-latest | 0 | 0 |
| mistral/ministral-8b-latest | 0 | 0 |
| mistral/mistral-large-latest | 0 | 1 |
| mistral/mistral-medium-latest | 1 | 0 |
| mistral/mistral-small-latest | 0 | 0 |
| mistral/open-mistral-nemo | 0 | 0 |
| mistral/pixtral-large-latest | 0 | 0 |
| meta-llama/Llama-3.2-3B-Instruct | 0 | 0 |
| meta-llama/Llama-3.3-70B-Instruct-Turbo | 0 | 0 |
| meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8 | 0 | 0 |
| meta-llama/Llama-4-Scout-17B-16E-Instruct | 0 | 0 |
| deepseek-ai/DeepSeek-3.2 | 3 | 1 |
| deepseek-ai/DeepSeek-R1-0528 | 9 | 4 |
| deepseek-ai/DeepSeek-R1-0528-Turbo | 9 | 5 |
| deepseek-ai/DeepSeek-V3.1-Terminus | 6 | 2 |
| moonshotai/Kimi-K2-Instruct-0905 | 0 | 0 |
| moonshotai/Kimi-K2-Thinking | 10 | 10 |
| grok-3-latest | 1 | 0 |
| grok-3-mini-latest | 0 | 0 |
| grok-4-fast | 10 | 10 |
| grok-4-fast-non-reasoning-latest | 1 | 0 |
| grok-code-fast-1 | 1 | 0 |
| sonar | 0 | 1 |
| sonar-deep-research | 2 | 0 |
| sonar-pro | 0 | 0 |
| sonar-reasoning-pro | 0 | 0 |
| Qwen/Qwen3-235B-A22B-Thinking-2507 | 10 | 1 |
| Qwen/Qwen3-32B | 0 | 0 |
| Qwen/Qwen3-Next-80B-A3B-Instruct | 6 | 0 |
| microsoft/WizardLM-2-8x22B | 0 | 0 |
| microsoft/phi-4 | 0 | 0 |
| nvidia/Nemotron-3-Nano-30B-A3B | 2 | 0 |
| zai-org/GLM-4.6 | 10 | 9 |
| zai-org/GLM-4.6V | 0 | 0 |
| zai-org/GLM-4.7 | 10 | 10 |
| NousResearch/Hermes-3-Llama-3.1-405B | 0 | 0 |
| NousResearch/Hermes-3-Llama-3.1-70B | 0 | 0 |
| allenai/Olmo-3.1-32B-Instruct | 1 | 0 |
The score of all models on the square problem has increased, in some cases quite significantly. There are only a few cases where the hexagon problem is resolved better than the square one, and this seems to match the cases where the model cheats and finds the corresponding OEIS sequence โ will talk about this in a future article.
Plotting the model performance across the two problems, we also see a different spread:
The costs, however, stay the same as in the previous table. But, since the performance on the square problem changed, the cost/performance graph for that problem also changes:
This means that the plot of cost/performance across both problems changes:
And this implied we get a new Pareto frontier:
As a list, the models that are on the frontier are:
grok-4-fast(score: 20.0, cost: $0.132)openai/gpt-oss-120b(score: 16.0, cost: $0.02)grok-4-fast-non-reasoning-latest(score: 1.0, cost: $0.006)
This time, the last model on the Pareto frontier is another Grok model. It got
one correct answer and is cheaper than gemini/gemini-2.0-flash-lite by 0.2
cents, so it dethrones it.
Overall, it seems that besides giving some expensive models their just score, the Pareto frontier barely moves.
Conclusions for future experiments ๐
I really need to be more careful in the future with these puzzles. If Iโm asking the model to be correct 100%, maybe I should first make sure that I expect the correct answer. I should always check all the aspects of the problem, not just the arithmetic ones.
Hope the LLMs forgive me for asking them to be correct when I was the one expecting them to provide the answer that I got to after a human mistake.
PS: I updated my demos repo with the new results and analysis. Next,
Iโll also add a correction note to the other two articles.
Comments:
There are 0 comments (add more):