Mihai's page

Introducing my first benchmark of AI for 2026

Exactly a year ago, I started an AI competition where I tested multiple LLMs on several math problems, with various prompting guides. You can see how each of these models perform by reading all 11 posts, or directly the conclusion.

For this year, I’m planning to run more than one single benchmark. One lesson I learned from the last year is that these tests should be easy to evaluate if I want to run more, so that is a central point for this series.

Are you curious to see how the AIs have evolved since last time? Let’s go on.

The problems 🔗

One thing I observed last year was that the LLMs really had issues with puzzles that required 2D grid of numbers. So, a big criteria in designing this benchmark was that I wanted to exploit this fact, to see if there was any new progress since the last year. Since last year I also posted several articles focused on 2D tilings and linear-algebra, I wanted this to the focus of these problems.

Thus, the task I devised is simple: lay out the numbers in a spiral on a grid, locate 2026 and do something related to the neighborhood. The simpler problem is:

The natural numbers are written on a spiral laid out on a square grid:

… 5 4 3 …
… 6 1 2 …
… 7 8 9 …

What is the sum of all 8 neighbors of the number 2026?

For diversity, I’m also adding a slightly more complex problem where I use a hexagonal grid:

The natural numbers are written on a spiral laid out on a hexagonal grid:

… 14 4 3 10 …
.. 15 5 1 2 9 …
… 16 6 7 8 …

What is the sum of all 6 neighbors of the number 2026?

Give it a try, try to answer these questions without using the AI. In fact, before going into the AI, I’ll also do that.

The solutions 🔗

To solve these, you have to locate interesting patterns for the rings of each spiral, and the corners. For example, in the square grid, the lower right diagonal contains the squares of the odd numbers.

But wait, 2025 is a perfect square number, which was the important property of the last year, giving raise to the have a perfect year, just like it’s a perfect square saying. So, 2025 is \(45^2\), making it be a number on this diagonal. That is, we can quickly locate 2026:

... 1850 2027 2212 ...
... 1849 2026 2211 ...
... 2024 2025 2210 ...

From this, the answer is 16408.

Moving on to the hexagonal puzzle, here we actually have to do a little bit more work. Carefully, we observe that the top left corner is one more than 3 times a square number (that is, \(3k^2 + 1\)) and this is a property of 2029. So, we can build the following grid starting from that corner:

   ...  2188  2187  2186  2185  2184  ...
...  2189  2029  2928  2027  2026  2025 ...
   ...  2030  1876  1875  1874  1873 ...

And from here we can determine that the sum we need is 12168.

As follow-up questions: can you see why the hexagonal grid sum has to be smaller? Are there any numbers where the square sum is smaller?

Now, let’s see how the AI performs.

The performance of LLMs 🔗

The initial plan was to use the llm library and write code to test each of the AIs in this way, just like I did the last year. But, I just saw an announcement from the Kaggle team, that community benchmarks are launched. So, I decided to do this test on Kaggle instead.

Without further ado, here is a notebook where I tested this problem.

There are only a few LLMs being tested here, and each one is being prompted only once per problem. There are only 4 LLMs that provide a correct answer, but none gets a correct answer on both problems. Take this as a preliminary round.

Turns out, I actually have to do the larger work and use the llm library instead. I will do that over the next few days. For now, I will stop here, given that today is both Eminescu’s birthday as well as mine – which is why I got the name that rhymes with it’s me, hi. But, back to AIs.

Please post in comments or any social media where this gets shared what LLMs would you like me to test this on. I cannot promise I will get all of them tested, but I will try. See you next time!


Comments:

There are 0 comments (add more):