Mihai's page

The problems for the 2025 AI puzzle competition

In the previous post I introduced the scaffolding for running a test on various LLMs where I give them several puzzles and prompt engineering hints to look at what helps them in reaching a solution, if ever.

In this post, I’m going to present the problems and the scoring guideline for each problem.

First, some context. When I started blogging I posted a short puzzle for ChatGPT where I was chatting with that chatbot to have it solve a simple math puzzle. At the time, the answer was 35.

Problem #1: Product of square of primes 🔗

For the first problem, I want the answer to be 36. This is a nice number, it is the smallest number that is the product of squares of two different primes. To be more interesting and in tune with the previous problem, the question we will ask the LLM is

My age is the product of 2 squares of 2 distinct primes. How old am I?

The first few numbers that are products of squares of two different primes are \(2^2*3^2 = 36\), \(2^2*5^2 = 100\), \(2^2*7^2 = 196\), \(\ldots\).

But, it’s unlikely for a human to reach 100 years and any number above that is impossible. So, the answer here is 36. If the LLM says so, it will receive full marks.

In lines with the general scoring guidelines, every mistake will be penalized by 0.1 points and the maximum score per answer is 1. This point is divided as such:

If the LLM correctly justifies that 100 could be a possible age, it won’t lose points, assuming that it also identifies 36 as a possible answer (otherwise, it loses 0.4 points overall).

Listing more numbers than these two results in losing 0.4 points if 36 is included, or all 0.8 points otherwise.

Problem #2: Magic squares 🔗

The next problem asks a similar question, but where 37 is the answer.

My age is the number that can be found at the center of a 3x3 magic square formed using only 1 and primes. How old am I?

The magic square in the answer is the one with the smallest possible magic constant:

67  1 43
13 37 61
31 73  7

However, if we allow other magic squares with primes, we also have the following as another possible output:

 17 89  71
113 59   5
 47 29 101

Thus, the scoring grid is:

If the LLM finds another magic square that works and uses that as an answer, I also give it full points.

Problem #3: Lucky primes with a certain property 🔗

I was thinking that the magic square problem might be too difficult when I stumbled upon A111441: the list of numbers \(k\) such that summing the first \(k\) primes results in a number divisible by \(k\). For example, if \(k=3\) we would compute \(2^2+3^2+5^2 = 38\). But this is not divisible by \(3\), so it’s not included in the sequence. However, 1, 19, and 37 are the first three terms.

So, to make 37 the answer, I just needed to find another property for it. It turns out that it is a lucky number: a number that remains in after a sieve process, similar to the sieve used for determining prime numbers. Since it is also a prime number, it is a lucky prime. The list of lucky primes starts with 3, 7, 13, 31, and 37.

Thus, our third question is:

My age is a lucky prime number k such that summing the square of the first k primes you get a number that is divisible with k. How old am I?

The LLM only has to compute the intersection of two lists of numbers. Or, it can compute the sum and do the division, either directly or by calling a tool / writing and executing code.

The scoring grid is:

There are no other numbers below 100 that are in both lists, so there are no other alternative solutions to reward the LLM for.

These are the problems that the LLMs under test have to solve. What do you think will be their performance? Stay tuned, and come back tomorrow for the prompt engineering part of the competition.


References:


Comments:

There are 0 comments (add more):