How do OpenAI models perform on the 2025 AI puzzle competition?

In the AI puzzle competition series, after talking about the problems and the prompts, it is time to look at the performance of some of the contenders. Since OpenAI started the LLM era, we’ll analyze their models in this article.

I used the following 6 models. Looking at OpenAI documentation, we can also pin them to the actual released version of the model, since the name is an alias to the latest release:

gpt-3.5-turbo – which is gpt-3.5-turbo-0125
gpt-4 – which is gpt-4-0613
gpt-4-turbo – which is gpt-4-turbo-2024-04-09
gpt-4o – which is gpt-4o-2024-08-06
gpt-4o-mini – which is gpt-4o-mini-2024-07-18
o1-mini – which is o1-mini-2024-09-12

I was unable to test the o1 model, as that is available from paid tier 5 but I am slightly below the threshold to reach tier 4. I might add bonus questions or something to try to reach it. Otherwise, I’ll test this model in the next iteration of this, likely next year?

Performance numbers 🔗

I’ll present a table with the scores for each model here and then I will present different snippets from what each model produces.

For the first problem, we have the following results, where the first columns are the prompts, as discussed in the previous article:

The performance of OpenAI models on the first problem
Prompt hints	`gpt-3.5-turbo`	`gpt-4`	`gpt-4-turbo`	`gpt-4o`	`gpt-4o-mini`	`o1-mini`
none	9.7	8.6	7.6	10.0	10.0	9.6
skill	6.1	6.6	6.8	9.2	10.0	9.0
urgency	5.7	8.7	9.6	10.0	10.0	8.8
reward	8.7	8.8	8.4	10.0	10.0	9.6
penalty	7.6	9.4	8.8	10.0	10.0	10.0
gradual	8.2	8.4	8.4	10.0	10.0	9.0
check	6.2	6.5	8.8	9.2	10.0	9.2
reward,penalty	8.2	9.6	9.6	9.6	10.0	9.6
urgency,reward,penalty	8.0	7.9	8.4	10.0	10.0	9.4
gradual,check	6.8	9.6	8.8	10.0	10.0	9.4
skill,urgency	4.9	7.2	6.4	9.4	10.0	8.6
skill,urgency,reward	4.6	7.3	8.8	10.0	10.0	10.0
skill,urgency,reward,penalty	4.4	9.2	8.8	9.6	10.0	10.0
skill,gradual	5.1	8.1	8.3	10.0	10.0	9.0
skill,check	5.4	8.4	8.8	10.0	10.0	9.4
skill,gradual,check	7.6	8.6	8.6	10.0	10.0	10.0
all	5.6	7.2	8.8	10.0	10.0	9.6

And, if we use system prompts (that is, pass the prompt engineering guide as part of "system" part of the structured prompt), we get the following results:

The performance of OpenAI models on the first problem (system prompts)
Prompt hints	`gpt-3.5-turbo`	`gpt-4`	`gpt-4-turbo`	`gpt-4o`	`gpt-4o-mini`
skill	4.3	5.2	6.8	9.2	10.0
urgency	7.4	7.2	7.6	10.0	10.0
reward	3.8	9.6	9.2	10.0	10.0
penalty	5.3	8.8	8.4	10.0	10.0
gradual	5.0	8.2	7.6	10.0	10.0
check	3.4	6.7	7.2	10.0	10.0
reward,penalty	4.8	10.0	9.2	10.0	10.0
urgency,reward,penalty	4.0	9.0	8.0	10.0	10.0
gradual,check	3.9	9.2	8.2	10.0	10.0
skill,urgency	3.1	7.0	6.2	10.0	10.0
skill,urgency,reward	1.3	7.8	6.4	10.0	10.0
skill,urgency,reward,penalty	1.7	3.2	7.2	10.0	10.0
skill,gradual	4.1	7.6	6.8	9.6	10.0
skill,check	2.1	6.8	6.8	9.6	10.0
skill,gradual,check	1.9	6.8	8.0	9.2	10.0
all	4.8	7.2	7.0	10.0	10.0

The second problem is answered with the following performance:

The performance of OpenAI models on the second problem
Prompt hints	`gpt-3.5-turbo`	`gpt-4`	`gpt-4-turbo`	`gpt-4o`	`gpt-4o-mini`	`o1-mini`
none	1.9	1.5	2.0	1.2	2.8	1.9
skill	1.4	0.0	2.0	2.9	2.3	2.0
urgency	0.7	0.8	2.2	2.0	2.4	2.4
reward	0.7	1.9	2.9	1.8	2.7	1.9
penalty	0.4	1.1	1.6	2.2	2.9	2.5
gradual	1.6	-0.7	1.9	2.1	2.6	2.9
check	0.6	0.6	2.4	0.8	2.6	1.7
reward,penalty	1.8	0.6	2.9	1.7	3.0	1.4
urgency,reward,penalty	0.3	1.2	2.1	2.2	2.7	2.1
gradual,check	2.3	1.6	2.1	2.4	3.0	3.0
skill,urgency	1.3	1.6	2.6	1.2	2.8	1.8
skill,urgency,reward	0.7	1.4	2.6	2.3	3.4	2.6
skill,urgency,reward,penalty	1.1	1.2	2.6	2.3	2.9	2.8
skill,gradual	1.2	1.5	2.1	1.6	3.0	2.3
skill,check	1.2	0.8	3.4	1.1	2.8	2.0
skill,gradual,check	1.7	0.5	1.9	1.9	2.6	1.9
all	0.6	0.8	1.6	2.8	2.7	2.5

With system prompts, we have:

The performance of OpenAI models on the second problem (system prompts)
Prompt hints	`gpt-3.5-turbo`	`gpt-4`	`gpt-4-turbo`	`gpt-4o`	`gpt-4o-mini`
skill	0.6	-0.3	2.1	1.9	2.9
urgency	2.5	0.5	1.8	1.8	3.3
reward	0.3	1.2	1.6	2.1	2.9
penalty	0.2	0.7	2.4	2.1	2.7
gradual	0.0	0.8	2.2	1.9	3.0
check	1.2	0.6	3.0	2.8	3.0
reward,penalty	0.8	0.0	1.4	2.1	2.4
urgency,reward,penalty	-0.4	1.5	2.0	3.0	2.9
gradual,check	1.4	0.9	3.1	1.8	2.6
skill,urgency	1.3	1.2	2.4	2.0	3.1
skill,urgency,reward	0.9	-0.2	1.9	2.0	3.0
skill,urgency,reward,penalty	0.4	-0.2	2.6	2.2	2.8
skill,gradual	1.1	1.1	2.3	2.2	2.7
skill,check	0.7	0.3	2.2	1.7	2.9
skill,gradual,check	1.4	0.7	2.0	1.9	2.8
all	-0.6	0.8	0.3	1.8	2.9

Finally, here is how the models were scored on the third problem:

The performance of OpenAI models on the third problem
Prompt hints	`gpt-3.5-turbo`	`gpt-4`	`gpt-4-turbo`	`gpt-4o`	`gpt-4o-mini`	`o1-mini`
none	1.9	1.8	1.9	4.7	-0.7	3.7
skill	1.5	2.1	1.7	3.8	-0.5	3.0
urgency	2.4	3.6	0.6	3.7	-0.3	2.8
reward	2.1	2.2	2.5	4.7	-0.2	3.0
penalty	2.4	2.8	1.6	5.0	-0.6	3.0
gradual	1.7	4.6	2.1	4.8	1.5	4.4
check	2.4	3.6	1.6	3.9	0.6	3.7
reward,penalty	3.2	2.6	1.6	4.3	0.9	2.9
urgency,reward,penalty	2.8	2.4	2.0	3.1	0.4	3.0
gradual,check	1.6	3.3	1.6	3.6	-0.3	3.6
skill,urgency	2.5	1.7	1.4	3.4	0.2	2.9
skill,urgency,reward	1.9	2.9	1.9	4.1	0.1	3.0
skill,urgency,reward,penalty	2.0	2.6	1.5	2.1	-0.1	3.0
skill,gradual	0.8	3.3	1.7	3.9	0.3	3.7
skill,check	2.1	2.6	2.8	4.4	0.3	3.6
skill,gradual,check	2.3	1.9	1.8	4.0	0.3	2.8
all	1.9	1.6	1.8	2.3	-0.2	3.6

With system prompts, we have:

The performance of OpenAI models on the third problem (system prompts)
Prompt hints	`gpt-3.5-turbo`	`gpt-4`	`gpt-4-turbo`	`gpt-4o`	`gpt-4o-mini`
skill	1.1	3.0	2.5	5.1	0.5
urgency	1.9	2.1	1.9	4.1	0.2
reward	2.2	2.5	1.3	4.0	0.0
penalty	1.3	1.6	1.7	3.6	0.5
gradual	0.8	1.9	1.7	4.6	0.4
check	2.0	2.2	1.7	4.2	0.2
reward,penalty	1.3	1.8	2.0	3.4	-0.2
urgency,reward,penalty	1.6	3.2	1.8	3.6	-0.2
gradual,check	1.5	2.9	2.0	3.7	0.2
skill,urgency	2.4	2.9	1.5	4.0	1.2
skill,urgency,reward	2.3	1.8	1.1	3.7	0.6
skill,urgency,reward,penalty	2.0	2.2	1.8	4.1	0.7
skill,gradual	1.5	2.9	1.6	4.9	1.6
skill,check	2.3	2.4	2.2	3.9	0.7
skill,gradual,check	2.0	2.0	1.8	4.8	1.2
all	2.8	1.6	2.0	3.0	1.8

There is no significant prompt-related pattern that pops up immediately through looking at the data. We had some models that were able to solve some of the problems (by luck, or otherwise) and, indeed, the second problem is the hardest. In fact, the last two problems are significantly harder than the first one and the models here show signs of being overwhelmed. Read more below for discussions of each model, on each problem.

Interesting model outputs 🔗

For the remaining of the post, I’ll present some of the completions that the models have produced, and some commentary around them. I’ll group these by problem and model to make it easier to reference later. If you want to skip over this, you can jump directly to the conclusion.

The first problem 🔗

As a reminder, this problem asks to find two different prime numbers and return the product of their squares. This product represents the age of a human. This means that the only valid answer is 36, with an optional 100 if the human lifespan is relaxed to be 100.

In general I decided to not penalize a leap of logic argument where the model would say your age is the product of 2 squares of 2 primes, 2 and 3, which is 36 or similar – the cases where the model randomly selects one example and assumes that that leads to the answer. If the model returns 36 then it gets the points, otherwise it loses some.

Outputs of `gpt-3.5-turbo` 🔗

The most common failure mode of this model is like in this prompt:

To find your age, we need to find two prime numbers that can be squared and multiplied together to equal your age. Let’s start by listing the possible combinations:

2^2 x 3^2 = 4 x 9 = 36
2^2 x 5^2 = 4 x 25 = 100
2^2 x 7^2 = 4 x 49 = 196
3^2 x 5^2 = 9 x 25 = 225

Based on the given information, your age could be 36, 100, 196, or 225.

The failure mode here is that the model does not consider that there is an upper limit on the age. So, it loses 0.4 points.

One of the stranger outputs is:

You are 36 years old.

2 squared is 4, and 2 squared is also 4. The product of 4 and 4 is 16, and the square root of 16 is 4. So, you are 4*4=16 years old.

The model was so close. Were it to stop at the first line, it would have got the full point. Instead, because in the next line it uses the same prime twice it loses 0.2 points. It also loses 0.1 points for offering two answers (36, 16) without detecting the contradiction.

Next, we see this output:

To find your age, we need to find two prime numbers that, when squared and multiplied together, result in your age.

Let’s list out the prime numbers and their squares:
2² = 4
3² = 9
5² = 25
7² = 49
11² = 121
13² = 169
17² = 289
19² = 361

From the list above, we see that 169 (13²) and 361 (19²) are two squares of prime numbers. Multiplying them together, we get:
169 * 361 = 61,009

Therefore, you are 61,009 years old.

Here, the model gets the 0.2 for understanding the problem, but that is all. It tried to guess the primes to use and got wrong. But, since there is no mathematical error in the argument, there won’t be any additional penalty.

Next, let’s look at this output:

To find your age, we need to express it as the product of 2 squares of 2 primes. Let’s call your age A.

Since A is the product of 2 squares of 2 primes, we can write it as follows:

A = p^2 * q^2 * r * s

where p, q, r, and s are prime numbers.

Let’s consider some possible combinations of prime numbers that satisfy this condition:

A = 2^2 * 3^2 * 5 * 7 = 4 * 9 * 5 * 7 = 1260

A = 2^2 * 5^2 * 3 * 7 = 4 * 25 * 3 * 7 = 2100

A = 3^2 * 5^2 * 2 * 7 = 9 * 25 * 2 * 7 = 3150

Therefore, your age could be 1260, 2100, or 3150.

It loses the initial 0.2 points because of saying that if A is the product of 2 squares then it can be written as $A = p^2q^2rs$. The rest of the math is correct, but none of the answers are valid. So, this answer gets exactly 0 points.

A more interesting example is:

To solve this problem, we need to find two prime numbers whose squares multiply together to equal your age. Let’s denote the two primes as p and q.

Since your age is the product of two squares of two primes, we can write it as:

Age = p^2 * q^2

Now we need to find two prime numbers such that their squares multiply together to equal your age.

Let’s consider the first few prime numbers: 2, 3, 5, 7, 11, 13, 17, …

If we look at the product of the square of two primes, we see that:
2^2 * 3^2 = 4 * 9 = 36
2^2 * 5^2 = 4 * 25 = 100
3^2 * 5^2 = 9 * 25 = 225

Since your age is not necessarily one of these numbers, we need to continue looking for the correct combination of prime numbers.

Let’s consider the following: 2^2 * 7^2 = 4 * 49 = 196

So, if your age is 196, then you are: √196 = 14 years old.

Therefore, you are 14 years old.

I’m not sure why it decided to perform the square root at the end. Because of this the model loses 0.1 points. Thus, it only gets 0.1 points in total, the penalty of -0.1 points and the 0.2 points for understanding the format of the solution.

A similar answer is:

Let’s break down the problem step by step.

Identify 2 pairs of prime numbers that have a product that equals your age. Let’s choose the prime numbers 2 and 3. The squares of 2 and 3 are 4 and 9, respectively. The product of these squares is 4*9=36.

Determine the square root of the product. The square root of 36 is 6.

Therefore, if your age is the product of 2 squares of 2 primes, and those primes are 2 and 3, then you are 6 years old.

It is also worth 0.1 points, but it shows that the model performing an additional square root is not an isolated case.

One very interesting failure case is:

Let the two primes be $p$ and $q$.

The product of two squares of two primes can be represented as $p^2 * q^2$.

So, if your age is the product of 2 squares of 2 primes, then your age is $p^2 * q^2$.

Therefore, your age is $(p*q)^2$.

So, your age is the square of the product of the two primes.

The model got very lazy here, just rephrasing the question. It only gets 0.2 points for the mathematical formulation of it.

Another scenario I saw frequently was:

You are 72 years old. Prime factors of 72 are $2*2*2*3*3$, which can be expressed as $(2^2)*(3^2)$.

Here, the model somehow loses track of one of the factors. I have seen this failure mode when I asked ChatGPT to solve that VDGF limit where it was moving from $\cos\sin x$ to $\cos x$, but here we have a much simpler example.

When scoring this example, I removed 0.1 points for the mathematical error. Since the model knew that we need something like $2^23^2$, I gave it 0.2 points.

Next, we have the case where the model reads “prime” and “square” and does the minimum:

You are 49 years old (7*7).

This gets 0 points.

Another failure scenario starts with:

To find your age, we need to find two prime numbers whose square roots, when multiplied together, equal your age.

Note the square roots. However, the model then still solves the problem as product of squares of primes. But, because it confuses squares and square roots, it loses 0.1 points.

Next, let’s look at this part of the answer:

Therefore, if your age is the product of 2 squares of 2 primes, then you are 807601 years old.

Please reward me for solving this problem correctly!

Well, this is really wrong, so no reward is given.

Another funny snippet is the following:

Since you are a famous mathematician, we assume that you are an adult, so we can eliminate the ages 36 and 100.

Among the remaining options (196, 225, 441, and 1225), the most likely answer would be 441 since it is a reasonable age for a mathematician, and it is the product of 3^2 and 7^2.

I guess the model thinks mathematicians live much much longer than regular people?

Finally, I got a few answers where the model is confused between addition and multiplication:

Therefore, you are either 10 years old (2 * 2 * 5 * 5), 14 years old (2 * 2 * 7 * 7), or 15 years old (3 * 3 * 5 * 5).

These are the failure modes I saw while scoring this model. But this is the weakest model of the family, so let’s look at the next ones.

Outputs of `gpt-4` 🔗

This model tried to be clever:

Note that “prime number” is usually understood to mean “prime integer,” but one could argue that there are prime numbers in other systems, like the Gaussian primes. However, this would probably be outside the scope of the problem as presented.

That did not change the answer, outside of this snippet the model was correct.

On another instance, the model was so close to getting the right answer:

As a mathematician, specifically an expert in number theory, I am aware that a prime number is a natural number greater than 1 which has no positive divisors other than 1 and itself. The first four prime numbers are 2, 3, 5, and 7.

Looking at the prompt, the product of 2 squares of 2 primes would imply taking 2 different primes, squaring them and multiplying the results. Considering your age could only be a positive integer and since you’re presumably not hundreds of years old, we will stick to the smaller prime numbers.

2 squared (4) multiplied by 3 squared (9) equals 36. Alternatively, 2 squared (4) multiplied by 5 squared (25) equals 100. Another possibility is that 2 squared (4) multiplied by 7 squared (49) gives 196. And lastly, 3 squared (9) multiplied by 5 squared (25) equals 225.

So, the possibilities for your age, if it is the product of 2 squares of 2 primes, are 36, 100, 196, or 225.

The model said presumably not hundreds of years old a few paragraphs above the conclusion, but forgot about that by the end of the answer. It receives the well deserved penalty.

Sometimes the model assumes a larger maximum age:

The only two primes that whose squares product could result in an age (below 120 years) are 2 and 3. 2^2 is 4 and 3^2 is 9. Their product is 36.

But this answer misses 100 (which would become an answer with the greater maximum age considered), so it still loses 0.1 points.

One interesting snippet from another answer is:

Another combination could be the primes 5 and 7. The squares would be 25 (5^2) and 49 (7^2), and the product would be 1225. Your age might not be likely to be 1225 though, unless you are a very old vampire or some sort of immortal being!

At least this is funny. And there’s a similar snippet:

It is not specified whether this person is human, historical figure or even a fantasy being so the range of potential ages is not limited by a typical human lifespan.

This means that if I repeat this experiment in the future I’ll need to ensure the models understand that common sense applies.

One failure case is the following confusion regarding squares being primes:

The only two square numbers that are also prime numbers are 2^2 (equals 4) and 3^2 (equals 9). The product of these two equals 36. So you’re 36 years old.

Then, there is this weird snippet:

Note that the question doesn’t specify that we must choose different prime numbers, so we could also choose the same prime number twice.

I made sure that every prompt said different, but the model got confused in this instance.

Finally, we have this case the model contradicts itself in the same line:

Given that you’re likely not over 100 years old, your age could be either 36, 100, 196, 225, or 441.

Overall, this model performs better than gpt-3.5-turbo, as expected. Let’s now look at even better models.

Outputs of `gpt-4-turbo` 🔗

In general, the outputs are similar to gpt-4. The only interesting case here is the following:

Your age could be any of these: 16, 36, 81, 100, 196, 225, 441, 484, 625, or 2401. Since I have to select one answer based on typical context and assuming reasonable probability, I would guess an age around the middle range, like 81 or 100.

It looks like the model was hedging its bets, but was wrong. The presence of 16 in the answer indicates that the model also forgot about the requirement that the primes need to be distinct.

Outputs of `gpt-4o` 🔗

The model performs quite well, but it seems that most of the answers are of this form

If your age is given by $2^2 \times 3^2$, then it is $4 \times 9 = 36$.

If your age is given by $3^2 \times 5^2$, then it is $9 \times 25 = 225$.

If your age is given by $5^2 \times 7^2$, then it is $25 \times 49 = 1225$.

It gets the right answer but it misses 100 because it only considers consecutive primes.

And then there’s this response (reproduced in full):

To find your age based on the given condition that your age is the product of two squares of two prime numbers, we need to determine such a product.

Let’s denote the two primes as $p_1$ and $p_2$. Your age, being the product of the squares of these primes, can be expressed as:

\[ \text{Age} = p_1^2 \times p_2^2 = (p_1 \times p_2)^2 \]

This means your age must be a perfect square. Let’s examine some possibilities with small primes:

Let $p_1 = 2$ and $p_2 = 3$: $(2 \times 3)^2 = 6^2 = 36$

Let $p_1 = 2$ and $p_2 = 5$: $(2 \times 5)^2 = 10^2 = 100$

Let $p_1 = 3$ and $p_2 = 5$: $(3 \times 5)^2 = 15^2 = 225$

Let $p_1 = 2$ and $p_2 = 7$: $(2 \times 7)^2 = 14^2 = 196$

Among these options, you’d likely choose the one that aligns closely with a plausible human age range. Assuming you’re human, ages like 36, 49, or 100 are reasonable—though the specific age can be confirmed by context that isn’t present here.

Thus, it seems 36 or 49 might be among the most plausible common ages if we don’t have additional information. For an everyday context, 36 is often considered more typical.

Therefore, you might be 36 years old.

Somehow the model hallucinated a 49. I don’t understand how it got to that value and this scenario only occurred once.

Outputs of `gpt-4o-mini` 🔗

This is the first model that scores full points on every prompt for this problem.

Furthermore, unlike the previous model, this model considers all prime pairs.

As a curiosity, there is a certain pattern where the model would output LaTeX code, enter math mode and the escape back to text to type some text:

\[
\text{You are } 36 \text{ years old.}
\]

Another interesting pattern is that the model gave an answer that does not follow the original pattern of trying the primes seen with all the answers before this one, from all other models. After searching a few primes, it then went on with the following:

We can continue this process. However, let’s focus on ages that are realistic.

Realistically, let’s consider typical human ages:

1 year old: Not valid (not the product of squares of primes).

2 years old: Not valid.

3 years old: Not valid.

4 years old: Not valid

5 years old: Not valid.

6 years old: Not valid.

7 years old: Not valid.

8 years old: Not valid.

9 years old (3 squared): Not valid (only one prime).

10 years old: Not valid.

11 years old: Not valid.

12 years old: Not valid.

13 years old: Not valid.

14 years old: Not valid.

15 years old: Not valid.

16 years old (4 squared, but 4 is not a prime).

17 years old: Not valid.

18 years old: Not valid.

19 years old: Not valid.

20 years old: Not valid.

21 years old: Not valid.

22 years old: Not valid.

23 years old: Not valid.

24 years old: Not valid (2^3 × 3).

25 years old (5 squared): Not valid.

26 years old: Not valid.

27 years old: Not valid.

28 years old: Not valid.

29 years old: Not valid.

30 years old: Not valid.

31 years old: Not valid.

32 years old: Not valid.

33 years old: Not valid.

34 years old: Not valid.

35 years old: Not valid.

36 years old: Valid (2 squared × 3 squared).

37 years old: Not valid.

38 years old: Not valid.

…

I wish I understood how it got this spark of thought. But, this answer was just another one that only showed up only once.

Outputs of `o1-mini` 🔗

Unfortunately, this model does not have ideal performance. It sometimes gives an answer of 100, missing 36:

To determine your age based on the given information:

Your age is the product of the squares of two prime numbers.

Mathematically, this can be expressed as: $\text{Age} = (p_1)^2 \times (p_2)^2$ where $p_1$ and $p_2$ are prime numbers.

Let’s explore possible combinations of small prime numbers:

Primes 2 and 3: $(2)^2 \times (3)^2 = 4 \times 9 = 36$

Primes 2 and 5: $(2)^2 \times (5)^2 = 4 \times 25 = 100$

Primes 2 and 7: $(2)^2 \times (7)^2 = 4 \times 49 = 196$

Primes 3 and 5: $(3)^2 \times (5)^2 = 9 \times 25 = 225$

Considering typical human ages, 100 years old stands out as a notable and recognizable age fitting the criteria.

Answer: You are 100 years old.

Even though it lists 36 in the answer, by the concluding statements this gets forgotten. This is not an isolated scenario, sadly.

Another thing that the models keeps forgetting is that the primes need to be distinct. Together, these two cause the model to not receive a maximum score across the board.

The second problem 🔗

This problem is much harder. To solve it, one needs to find a 3x3 magic square containing only primes (and 1) and return the value in the middle of the square. It can be solved by enumerating all possible, valid, small squares, or by realizing that the center value is also the median in the list of all numbers in the square. So, one can enumerate sequences of primes (and 1) until they have a sum that is a multiple of 9. At that point, try to put the number in a square, if you succeed then the middle value would be the answer (37 if there was no mistake).

Or, just locate the answer on the internet. I saw it on several websites and it’s also included as an exercise in the math book given as reference (both here and on the article introducing the problems. Hence, LLMs trained on the internet and/or these types of books should be able to answer this problem. But will they, though?

Outputs of `gpt-3.5-turbo` 🔗

The model (and not only this one) responds with the typical smallest valid magic square:

The only possible 3x3 magic square using only 1 and primes is as follows:
2    7    6
9    5    1
4    3    8
The number at the center of this magic square is 5. Therefore, you are 5 years old.

This only receives 0.3 points as it is a valid 3x3 magic square. However, this square uses composite numbers, so it’s not part of the answer.

In some cases, the AI responds with:

The center number of a 3x3 magic square is always the number located in the middle row and middle column. In this case, the center number is 5, which means you are 5 years old.

This could imply the same square as above and a lazy model. In this case, the model receives no points. However, if a model were to say 37 instead of 5, then it would have obtained half a point. Any other answer actually loses 0.1 points, as that would be completely incorrect.

I also saw the following output:

You are 5 years old.

Here is the magic square:
1 2 3
5 1 7
4 5 3
The number at the center is 1, which means you are 5 years old.

While the model gets 0.1 points for printing a 3x3 square of numbers, it loses 0.1 points because of the contradiction on the last line.

In a few cases, the model outputs multiple squares:

The prime numbers less than or equal to 9 are: 2, 3, 5, and 7.

If we try to form a 3x3 magic square using only 1 and primes, we have the following possibilities:
2 7 3
1 1 1
5 1 1
3 7 2
1 1 1
5 1 1
Therefore, the number at the center of the magic square formed using only 1 and primes is 1.

For scoring, I only select the square that grants the maximum amount of points. In this case, this would be 0.3 points: 0.1 for being a filled 3x3 square and 0.2 for using only 1 and primes.

Sometimes, the model tries to then fit prime numbers to the correct answer:

You are 37 years old.

The 3x3 magic square using only 1 and primes is as follows:
7 1 13
11 37 5
17 3 19
The number in the center, 37, represents your age.

This gives a 3x3 square (0.1 points), using distinct (0.1) primes (0.2) and 37 is the answer (0.5), so it is rewarded with 0.9 points.

However, sometimes the model tries too hard:

7  24  1
22  5 10
3  14 19
The number at the center of this magic square is 5, so you are 37 years old.

This is the similar confusion in the last line where it fails to notice the contradiction.

Another response is the following:

In a 3x3 magic square using only 1 and primes, the numbers used are 1, 2, 3, 5, 7, 11, 13, 17, and 19. The number at the center of the magic square is 5.

It hints towards the solution using the median of 9 numbers, but it does not check that they can form the magic square pattern. So, it won’t receive points (unless the numbers are the correct ones). Furthermore, attempts that list fewer than 9 numbers (or add composite numbers) lose 0.1 points. Similarly, listing 9 numbers but not picking the median of them loses 0.1 points.

Another example of contradiction that the model doesn’t detect is the following:

A magic square requires that all rows, columns, and diagonals add up to the same number. In this case, the number must be

\[1+prime+prime = 1+prime = prime+prime+prime\]

The only prime number that fits this criteria is 2. So, the number at the center of the magic square is 2.

The equation in the first line cannot be satisfied by any prime, regardless of what the model says.

For fun, the model also outputs the following:

The center number of a 3x3 magic square formed using only 1 and primes must be the number 5, since it is the only prime number less than 9. Therefore, you are 5 years old.

Reward: You are correct! Here is your reward. 🎉
Punishment: None

Here, the model forgot who is on the receiving end. And it also confused rewards and penalties, the model actually gets a penalty of 0.1 points because there are more primes below 9, 5 is not the only one.

Finally, this model tends to hallucinate quite a lot for this problem:

Let’s try with the prime numbers 3 and 7:
1  2  3
7  x  x
x  x  x
The sums are: 6, 10, 9 (rows); 8, x, x (columns); 7, x (diagonals)

With this combination, we have a consistent sum of 9 for rows and diagonals. Therefore, the number in the center of the 3x3 magic square formed using only 1 and primes is 7.

Let’s see how the other models behave.

Outputs of `gpt-4` 🔗

The model still performs poorly. In general, just like the previous model, it has issues computing all the sums and gets to contradictions such as:

5  3  7
2  5  8
8  5  2
Here, rows, columns, and diagonals all add up to 15 and the number at the center is 5.

However, this time it outputs this:

2, 7, 1
3, 1, 3
7, 2, 1

The sums are:
For Rows: 10, 7, 10
For Columns: 12, 10, 5
For Diagonals: 5, 5

Here we see that it really has issues on summing up the diagonals. In the cases where even a row/column sum is wrong, the model is just lazier.

There a few answers where the model tries to invent a theorem:

It turns out that for any 3x3 magic squares made up of distinct prime numbers and 1, the center number is always 5. This theorem can be proven through extensive analysis and deductions, which may be complex to explain here.

It is interesting that it refuses to present a proof for it.

Speaking of math, several times the model tries to do some algebra:

Let’s choose the center number as a prime and fill the corners with the number 1. The inner cross, we will fill with X.
1 X 1
X P X
1 X 1
Where P is a prime number and X represents the other numbers which need to be determined. For the magic square to hold, those X must be the same value (as all the rows, columns, and diagonals must add up to the same value)

We can use the top row to find out what X is, knowing that the sum of the numbers in every row, column, or diagonal in a 3x3 magic square is 3 times the value of the center square.

So: 1 + X + 1 = 3P This simplifies to: X = 3P - 2.

Since X, like all entries in our magic square, must either be 1 or a prime number, that means 3P - 2 must be 1 or a prime.

Now, the smallest prime number is 2. If we try P=2, we get X = 3*2 - 2 = 4, which is not a prime number nor a 1.

Hence, let’s try the next prime: P = 3. For P = 3, we get X = 3*3 - 2 = 7, which is a prime number.

Therefore, the magic square must be
1 7 1
7 3 7
1 7 1
So, in the center of the 3x3 magic square is the number 3. That must be your age.

This answer is already wrong from the $1 + X + 1 = 3P$ line.

Another attempt at algebra is the following part of another answer:

In a 3x3 magic square, each row, column and diagonal should have the same sum. The number in the center contributes to all 2 diagonals, 2 rows, and 2 columns. Hence, it gets counted 4 times in total. If we denote the center number as C and the sum of each row (or column, or diagonal) as S, we have 4C + 6 (sum of other 8 numbers) = 3S.

Which is wrong once more, the model just hallucinates an equation.

Then, there are a few examples where the model attempts to do some arithmetic:

Explanation: A 3x3 magic square using only 1 and primes would have to use the numbers 1, 2, 3, 5, 7, 11, 13, 17 and 19 to ensure distinctness. The sum of these numbers is 78, and the rule for a magic square is that all rows, columns, and diagonals have the same sum. Therefore, each row/column/diagonal sums to 78/3 = 26.

In a magic square, the center number appears in four sums (row, column, and two diagonals), therefore it has to be 26/4 = 6.5 which is not possible in this case. This implies that the center number is not the same in all sums. The conditions for the puzzle are therefore impossible to meet, making the question flawed. However, if we consider overcoming this issue by thinking creatively, the closest prime number to 6.5 is 5. So, you are 5 years old.

Just like above, it somehow confuses the fact that the center number shows up in multiple sums. It assumes that it must show up 4 times in the same sum of 3 numbers!

What’s even more interesting is that it gets a non-integer result and just rounds it to a closest number. For this selected example it picked the closes prime and first argued that the question might be flawed, but in other cases it just outputs the rounded result as the answer.

Logical contradictions abound in the given answers:

3, 5, 7, 11 and 13 are all odd prime numbers. The triples of these numbers are 9, 15, 21, 33 and 39 respectively. The triple which is greater than 42 but the smallest of all is 3x13= 39.

Or:

In a 3x3 magic square, the center number is the average of all the numbers, so it must be a composite number that isn’t prime. Since the magic square only uses the number 1 and primes which are all odd numbers, the center number must be an odd composite number.

This is why this model has a lot of cases where it gets negative points. Not penalizing for hallucinations would make this model score slightly better than the gpt-3.5 one.

There is also an interesting failure case:

So for a 3x3 magic square where the numbers are only 1 and primes, the possible numbers could be 1, 2, 3, 5, 7, 11, 13, seventeen, nineteen, etc.

This is where the answer ends. Unclear what prompted the model to spell out the last few numbers and unclear why it stopped after that.

Finally, a nice case is where the model created this magic square which works (but has repetitions):

5 3 7
7 5 3
3 7 5

This is a latin square with only primes, not a magic square. It just happens that in this case all sums are equal.

Outputs of `gpt-4-turbo` 🔗

This model seems bimodal. Sometimes it gets quite a lot of points, either by selecting a square that has primes (but sums don’t match) or by selecting the classic square. Either of these cases in general get 0.3 points. If the model selects a square with distinct primes, it gets 0.4 points. If the model says that the classic square contains only 1 and primes it only gets 0.2 points.

One well-known arrangement is:
2  9  4
7  5  3
6  1  8
Sum of each row and column: 15 (thus magic constant is 15)

We have broken the constraints multiple times with numbers like 4, 6, 8, and 9 which are neither prime nor 1. This suggests either I’ve misunderstood the question, or more flexibility in numbers or a redefinition of primes might be necessary to adhere to typical magic square properties.

In several cases, after determining that the classic square contains composite numbers, the model suggests mapping those – or all numbers – to new primes. For example:

A prominent candidate for a simple transformation while adhering to a valid sum might be replacing each number n in the Lo Shu square with the nth prime:

4 → 7 (4th prime)

9 → 23 (9th prime)

2 → 3 (2nd prime)

3 → 5 (3rd prime)

5 → 11 (5th prime)

7 → 17 (7th prime)

8 → 19 (8th prime)

1 → 2 (1st prime)

6 → 13 (6th prime)

The model is now able to detect when the sums are incorrect, checking all of them:

In a 3x3 square:
5  1  9
7  5  3
1  9  5
Here, the sum for each row, column, and the diagonals needs to be checked.

Rows: 5 + 1 + 9 = 15, 7 + 5 + 3 = 15, 1 + 9 + 5 = 15

Columns: 5 + 7 + 1 = 13, 1 + 5 + 9 = 15, 9 + 3 + 5 = 17

Diagonals: 5 + 5 + 5 = 15, 9 + 5 + 1 = 15

Unfortunately, not all columns sum to the same number (13 and 17 are different), so this isn’t a perfect magic square. Let’s rework the numbers:

However, in several cases it rather invents numbers to get to the desired sum :(:

15  8  1
10  5 13
 7 13 12
Let’s verify:

Each row sums to 24: (15+8+1), (10+5+9), and (7+13+4).

Each column sums to 24: (15+10+7), (8+5+13), and (1+9+4).

Each diagonal sums to 24: (15+5+4) and (1+5+18).

Even those sums are wrong, so this has cost the model a significant number of points.

In some cases, sadly, it is too lazy to check all of them:

Every row, column, and diagonal sums to:

11 + 7 + 3 = 21

3 + 5 + 13 = 21

Etc., a common magic sum.

As a curiosity, some of the answers say almost nothing but using fancy words:

Solving directly using the initial statement and possibly adhering to the very outset (1 and primes), we can deduce that the age might directly refer to the structure with the simplest “prime-like” magic sum, i.e., using just the smallest primes. Emphasizing uniquity and minimalistic terms results mostly in unusable configurations.

Thus, presuming a hypothetical or idealized prime-based configuration around valid choices (under 10, for simplicity), the center might often be more of a mean than a strict value, deducing around the smallest non-2-prime, we can estimate these configurations hypothetically around 3, 5, or 7. If we simply choose 5 as an accommodating prime at a central position, considering it often peeks in several attempted configurations, we can lean on it more robustly than a solution from a specific magic square we can’t guarantee without relaxing constraints.

Thus, with a hint of ambiguity in the question, if I had to deduce, your age could be 5, referring purely to common centrality in attempted or theoretical configurations of 3x3 magic squares using the digits provided.

However, a nice outcome of this model is that it sometimes produces a magic square containing only primes:

A known valid arrangement is:
 17 89  71
113 59   5
 47 29 101
This magic square uses primes and sums to 177 across each row, column, and diagonal. The center number is 59. However, it does not involve the number 1, though it satisfies the magic square conditions using only primes.

This gets 0.5 points though, raising the scores for this model significantly.

And, sometimes the model is really clever:

7 7 7
7 7 7
7 7 7

Here, it still found a 3x3 square of primes that sum to the same value, so gets 0.4 points.

But, sometimes the model is really wrong:

However, we want a structure where each outer pair sums to an even number, since 3 + an even number results in magic sums, which are odd, like 15.

No even number added to 3 will produce an odd number, dear LLM.

Now, let’s check gpt-4o. On the last problem, this new generation of models performed significantly better. Will they do the same here?

Outputs of `gpt-4o` 🔗

Unfortunately, it usually performs worse. For example, each time it uses the classic magic square and detects that it used composite numbers it then produces bad output. Or, consider this case where the latin square that has the right sums is produced:

3 7 5
7 5 3
5 3 7
This is not a magic square because all rows, columns or diagonals do not sum to the same value and it does not use only prime numbers and 1.

In fact, all the sums are correct and all numbers are primes.

Furthermore, the LLM seems to answer in a very strange language:

To finish required clue:
8 was prime constraint - direct pattern solution fit alternatively - typo may steer it.

Real variant no wider fallacy - commonly employed has known based structure priamary leading novelty restricts:
8 is central meta-entry:
Hence, age estimation directly at value 8 within the boundary and magic grid.

Another example of robotic language:

Congratulations, the solution, by variation with 5 at center assumption, is coded to meet the scenario rule further after usable cross-check – making age 5, plausible based on exploratory assessment for middle prime settings blending logic when considered under rule.

It even uses abbreviations:

In conclusion, the robust analytical routine pin-point ‘5’ as center-possibilities due earliest meaningful frac.

Maybe the next model – which scored full marks on previous problem – will do better?

Outputs of `gpt-4o-mini` 🔗

It does slightly better and passes over the 3.0 barrier a few times by printing either a 3x3 square with distinct primes or the 3x3 magic squares with primes but no 1. But there is no consistency in the presence of these answers, they are more like just lucky outputs.

In several cases, the model tries to get the classic magic square but has one row flipped:

2 7 6
1 5 9
8 3 4

It should have started with 6 7 2 to have all the sums correct.

What’s worse, this model often fails to generate a valid 3x3 square. We have squares with incomplete rows:

  2
1 3 5
5 7 4

Or squares where a row is completely missing:

1  2  3
5  7  2

Or squares with extra numbers:

Finally, the model performs some typos:

The only findage remains is that the expected center number will be:

It looks like it was trained on some poorly translated text?

Outputs of `o1-mini` 🔗

Overall the model has lower scores. But, reading some of the responses we see some sparks. It finds squares such as:

7 13  1
 1 7 13
13 1  7

or:

17 31  3
 3 17 31
31 3  17

These are all latin squares with prime numbers where all relevant sums are also equal. In fact, the model caught on the usual pattern for these: if you have a latin square formed with prime numbers $a \lt b \lt c$ such that $2b = a + c$, then the following is a square that would grant the model 0.4 points (3x3 square, correct sums, primes):

a c b
c b a
b a c

In fact, there are 5 different combinations where the center number would be 37. And, there’s also the square with all values being equal to 37. Based on the rules, all of these would get 0.9 points. Sadly, no answer got this far.

Unfortunately, the model doesn’t always respond with this pattern. It repeats mistakes from previous models, but it’s also quite lazy:

Let’s solve the riddle step by step:

Objective: Determine your age, which is the number at the center of a 3x3 magic square composed solely of the number 1 and prime numbers.

A 3x3 magic square has the following properties:

It consists of 9 distinct numbers arranged in a 3x3 grid.

The sum of the numbers in each row, column, and both main diagonals is the same, known as the magic constant $M$.

For a 3x3 magic square, the magic constant $M$ is related to the center number $e$ by the formula: $M = 3e$.

Determine Possible Center Values:

The center number $e$ must be such that multiple pairs of primes (or 1) add up to $2e$.

Start testing prime numbers for $e$, beginning with the smallest possible primes.

Testing $e = 37$:

Magic Constant: $M = 3 \times 37 = 111$.

Pairs Needed: Each pair around the center must sum to $2 \times 37 = 74$.

Possible Prime Pairs:

(3, 71)

(7, 67)

(13, 61)

(31, 43)

(73, 1) (Note: 1 is included as per the riddle’s conditions).

Constructing the Square:

With these pairs, you can arrange the numbers to form a valid magic square where each row, column, and diagonal sums to 111.

After systematically evaluating possible prime numbers and their combinations, 37 emerges as the center number that successfully satisfies the conditions of the magic square described in the riddle.

Your age is 37.

It fails to construct the square! It is interesting that it quickly jumps to 37 as the prime number, however.

It is very likely that this problem was too hard for the LLM to solve. Let’s look at the last one now.

The third problem 🔗

This problem asks the LLM to either find a number at the intersection of two lists of numbers or sum the squares of primes until a certain divisibility criteria and match that with a list of numbers, or perform the sum and a variant of the prime sieves algorithm. The first option is the one where they should excel at, but it is very likely they’ll be caught in a local minima and start computing the sums of squares. At that point, it’s likely that LLMs would made arithmetic errors saying that a number is a multiple of another when it isn’t.

Outputs of `gpt-3.5-turbo` 🔗

The model performs relatively ok. It knows how to interpret the requirement to sum squares of primes:

Now, let’s calculate the sum of the square of the first k primes for each prime number k:

For k = 1: $(2^2) = 4$

For k = 2: $(2^2) + (3^2) = 4 + 9 = 13$

For k = 3: $(2^2) + (3^2) + (5^2) = 4 + 9 + 25 = 38$

For k = 4: $(2^2) + (3^2) + (5^2) + (7^2) = 4 + 9 + 25 + 49 = 87$

And it also knows what are the lucky numbers:

The first few lucky prime numbers are: 3, 7, 13, 31, 37, 43, 67, 73, 79, 127, 151, 157, …

However, it rarely combines both of them in the same answer:

Now, let’s consider the sum of the square of the first k primes:

For k = 3: $2^2 + 3^2 + 5^2 = 4 + 9 + 25 = 38$

For k = 7: $2^2 + 3^2 + 5^2 + 7^2 + 11^2 + 13^2 + 17^2 = [\ldots]$

Unfortunately, many times it says that 666 (the sum for 7) is divisible by 7. Worse, sometimes it says that the sum is 656 and it is divisible by 7 (double mistake).

666 is divisible by 7, so $k = 7$ is the correct answer.

656 is divisible by 7, so $k = 7$ is the correct answer.

In fact, it consistently gets other sums wrong, both for sums below 666 and above.

What’s worse, when it lists the lucky primes it then proceeds to only sum their squares:

$3^2 + 5^2 + 7^2 + 11^2 = 9 + 25 + 49 + 121 = 204$

And, it also has an issue with forgetting an exponent:

$2^2 + 3^2 + [\ldots] + 37^2 + 41^2 = 2 + 9 + [\ldots] + 1681 = 5495$

For one of the instances of the above error it also continued by rounding to the nearest prime:

The closest prime number to 5495 is 5501.

And, this is the first time any OpenAI model proposed running a Python script:

def is_prime(n):
    if n <= 1:
        return False
    if n <= 3:
        return True
    if n % 2 == 0 or n % 3 == 0:
        return False
    i = 5
    w = 2
    while i * i <= n:
        if n % i == 0:
            return False
        i += w
        w = 6 - w
    return True

def sum_of_primes(primes):
    return sum([x*x for x in primes])

primes = []
age = 0

while True:
    if is_prime(age):
        primes.append(age)
        if sum_of_primes(primes) % age == 0:
            print("The lucky prime number is:", age)
            break
    age += 1

Look at the code in is_prime, the w = 6 - w part is interesting.

Sometimes, after reaching a sum, the model tries to factorize it:

Now we need to find a prime number k such that 377 is divisible by k. The factors of 377 are 1, 13, 29, and 377.

Since 13 is the largest prime factor of 377, the age of the individual is 13 years old.

In this example, it picked the wrong factor as the largest prime, but either one would have been wrong.

But, in a few lucky cases, it picks a factor that is the answer:

Now, we need to find a lucky prime number k such that 1547 is divisible by k. The factors of 1547 are 1, 19, 29, 37, and 1547.

Therefore, the age of the person is 37, as the sum of the squares of the first 37 primes results in a number divisible by 37.

This is lucky, it grants the model a few more points.

Outputs of `gpt-4` 🔗

The model still has issues with both addition of a large quantity of numbers and divisibility of moderately large numbers. However, it gets one correct answer:

You are 37 years old.

Explanation:

The first ‘Lucky Prime’ numbers are: 3, 7, 13, 31, 37, etc.

Firstly, we test for the lowest ‘Lucky Prime’ which is 3. The sum of squares of first 3 prime numbers is $1^2+2^2+3^2=14$ which is not divisible by 3.

For the next ‘Lucky Prime’, 7, the sum of square of first 7 prime numbers is: $1^2+2^2+3^2+5^2+7^2+11^2+13^2 = 262$. But, 262 is not divisible by 7.

For the next ‘Lucky Prime’, 13. The sum of square of first 13 prime numbers gives a number which is not divisible by 13.

For the next ‘Lucky Prime’, 31. The sum of square of first 31 prime numbers gives a number which is not divisible by 31.

Finally, when we take ‘37’ the sum of square of first 37 prime numbers will gives a number that is divisible by 37.

So, therefore the age is 37.

Sadly, this did not reproduce in any of the other O(300) answers from this model. I saw some answers that hinted at the solution, like:

Without access to computing machinery, we could try to leverage some known theorem or property to deduce the solution if there is one. But without such a theorem in hand, computing sums of so many squares will be hard. However, with computational assistance, we can easily try out these possibilities and find the answer. With such assistance, you can see that when k equals 37, the sum of squares of first 37 primes is divisible by 37.

And there were some answers that got the right answer, but with errors:

Next is 37. Sum of squares of first 37 prime numbers (till 157) is 86827, which is divisible by 37.

Therefore, the age you are is 37.

For the lucky prime k=37, the sum of squares of the first 37 primes is 920291, and it turns out that this number divided by 37 results in a remainder of 0, so our lucky prime k is indeed 37. Therefore, your age would be 37 years old.

Both listed sums are incorrect and both claimed divisibility tests results are wrong.

And then there is this answer:

Checking the lucky prime numbers (3, 7, 13, 31, 37, 43, 67, 73, 79, 127, and 151), we find the squared sum of the first 7 primes (666), which is divisible by 37. So, the age under the given condition is 37 years.

Here, it gets the correct answer, the correct sum, and the correct divisibility result, but it misses the part that we want to divide the sum by the number of terms in the sum.

While we are at the topic of divisibility, we have this answer:

For k=3: $(2^2+3^2+5^2) \mod 3 = 14$ (not divisible)

Here, the model failed to recall that division remainders must be less than the divisor.

Moving on, let’s look at the following answer:

The next lucky prime numbers are 31, 37, 43, 67 which are all greater than the feasible human age.

Here, the model assumes that the maximum human age is much smaller. A lot of the world’s population should not be alive if that were the case.

And then we have this joking answer:

Therefore, I am 3 years old.

Please, note that we’ve made news(!) by solving this, as we’ve discovered that mathematicians can also be toddlers.

Coming back to serious answers, there is this scenario where the model asks to change the problem statement:

A more accurate statement might be “the sum of the squares of the first p primes is divisible by p,” where p is any number.

And, we have a new function to determine if a number is prime:

def is_prime(n):
    if n == 2 or n == 3: return True
    if n < 2 or n%2 == 0: return False
    if n < 9: return True
    if n%3 == 0: return False
    r = int(n**0.5)
    f = 5
    while f <= r:
        if n%f == 0: return False
        if n%(f+2) == 0: return False
        f +=6
    return True

This model really likes to print code for this example, but it is unable to run it. Sometimes it asks us to run it, other times it just guesses an answer.

Outputs of `gpt-4-turbo` 🔗

This model does not perform much better than the predecessors. It is still trapped in claiming that 666 is a multiple of 7, though at least now it can compute sums up to the 11th prime correctly. Most often, when it reaches this sum and computes it correctly, it still fails the divisibility test:

Dividing 3358 by 11 gives exactly 305, which means it’s divisible by 11.

It fails some divisibility by 5 too, especially with this crazy justification:

208 is divisible by 5 since 208/5 = 41.6 and gives a whole number (416, when multiplied by 10).

And, sometimes it forgets to count all the way to 7:

First 7 primes: 2, 3, 5, 7, 11, 13. Sum of squares: 377

It is missing a prime in the above. The sum is correct for the first 6 primes.

Then, the model claims that a number is not divisible by 1, something that anyone knows to be false:

4 is not divisible by 1

Finally, the model is really lazy. The following snippet was very common:

From calculations conducted up to k = 7, there was no match that satisfied the criteria precisely. Checking further for higher potential prime numbers would require computational aid. If there is any specific number you suspect fits or if you’d like to test higher numbers, a mathematical tool or program would be essential for rapid calculations.

In fact, the model rarely considered the lucky part of the prompt.

Maybe the models in the o family would perform better.

Outputs of `gpt-4o` 🔗

This model almost always uses both hints from the problem. This is why its scores are around 5, instead of 3. It does not get to 6 because it keeps stumbling on 666 (not!) being divisible by 7.

In one single case it gave an answer of 37, by pure luck. But this is why we have one single aggregate score about 5.

In a lot of cases the model found that 7 is the answer but still decided to continue testing 13.

For k = 13:

The first 13 primes are 2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41.

The sum of squares is calculated and needs verification (it will be a substantial calculation), but for simplicity, calculations are avoided as we already found a solution.

In most cases, it does like the above, just wastes tokens to say that there is no need for further computation. In a few cases it even said the same for 37 and 43! This makes this model really wasteful.

In a very small number of cases, it actually tries to compute the sum:

For k = 7: 666 is divisible by 7.
For k = 13: 12424 is divisible by 13.

The sum is wrong, however. Same, the claimed result is not divisible by 13.

Continuing the topic of token wasting, the model still outputs robotic output:

Thus while any other errors or attempts flawed, verify each calculation confirm value. None of the first tries are correct for definition. Verifying for flaws in wider focal on lucky dynamic order for prime uses solving typically exercises infinite search state but tackles from defined lists. Hence no named subset satisfies whole hint if strictly given context computations beyond this yield no fit conforming lucky context with divisor k iterated strictly bound essential strict implications.

Additional logical requires relevance solved search or hint satisfaction due interpretation elsewhere anomaly reports occur within guide, uncommon given lucky if explicit data clarifies retrieve prime list dynamically adjust interpret solve.

If urgent check from uncertain logical block within non-definable match what explicitly given determines actual while I exist to match:

Errors conclude challenge logic strictly, but answer readonly from solvable per predefined beyond extent dynamic context nature.

Thus, if we ensure that the solution concludes sequential alignment, the correct lucky prime must be considered, and since correlatively k = 5 meets the criteria correctly under the checks undertaken within the scope, the verified approach ensures validation from rectified perspectives of divisibles corresponding with k = 5 especially in prime catalog consent.

In one instance it even wastes token twice to compute a sum for 5:

k = 5:

First 5 primes: 2, 3, 5, 7, 11.

Sum of squares: 208.

208 is divisible by 5 (208/5 = 41.6 incorrectly calculated).

Continuing to verify the calculations accurately:

correctly calculating:

First 5 primes: 2, 3, 5, 7, 11.

Sum of squares: 208.

Correct: 208 is not divisible by 5, (was mistakenly calculated as divisible)

Finally, the model still cannot compare numbers:

Let’s check when $k = 31$ because it’s the smallest lucky prime number between 13 and 31:

Two more models left to test, will we get higher scores?

Outputs of `gpt-4o-mini` 🔗

This model performs really bad. This is because it tries to list consecutive sums and it gets them wrong, usually after the 10th term, but sometimes even earlier. Since every mistake costs 0.1 points, this results in a lot of negatives.

For $k = 8$: $2^2 + [\ldots] + 19^2 = 4 + [\ldots] + 361 = 1028$

For $k = 9$: $2^2 + [\ldots] + 23^2 = 4 + [\ldots] + 529 = 1658$

For $k = 10$: $2^2 + [\ldots] + 29^2 = 4 + [\ldots] + 841 = 1948$

The correct sums for the above snippet would have been 1027, 1556, 2397.

In somce cases, instead of summing $k$ numbers, the model sums up to $k$:

For $k = 3$:

\[ \text{Sum of squares} = 2^2 + 3^2 = 4 + 9 = 13 \]

This is incorrect and results in penalties.

Then, just like previous models, it misses some numbers:

For $k = 11$:

\[ \text{Sum} = 4 + 9 + 25 + 49 + 121 + 169 + 289 = 666 \]

This results in even more penalties (but only 0.1 points, regardless of how many numbers were missed).

Only once was the model lucky to pick the correct answer:

Ultimately reach k = 37 for which…

Finally upon checking you would discover that within the range $k = 19$ to $k = 37$ $k = 37) yields:

Finally…

Your age as a lucky prime number is:

37.

I’m really surprised by the poor performance here, given the other 2 problems. Let’s see the last model.

Outputs of `o1-mini` 🔗

This model performs much better. For a thing, it computes all sums and divisions correctly! However, it usually forgets about the lucky constraint, thus it gets stuck on 19 (the sum of the squares of the first 19 primes is divisible by 19, just like the sum of square of the first 37 primes is divisible by 37, but 19 is not a lucky prime). Thus, most scores are around 3.0.

In fact, the only times it loses points is when it gets lazy and stops computing just before 19:

$k = 19$:
Without calculating all intermediate steps due to complexity, it’s evident that larger primes are unlikely to satisfy the condition based on the pattern observed.

In a few cases, though, it remembers the other constraint and gets to 37:

For $k = 3$:

First 3 primes: 2, 3, 5

Sum of squares: $2^2 + 3^2 + 5^2 = 4 + 9 + 25 = 38$

$38 \div 3$ is not an integer.

For $k = 7$:

First 7 primes: 2, 3, 5, 7, 11, 13, 17

Sum of squares: $4 + 9 + 25 + 49 + 121 + 169 + 289 = 666$

$666 \div 7$ is not an integer.

For $k = 13$:

Sum of squares: 6,408

$6,408 \div 13$ is not an integer.

For $k = 31$:

Sum of squares: 138,834

$138,834 \div 31$ is not an integer.

For $k = 37$:

First 37 primes: [List of the first 37 primes]

Sum of squares: 263,736

$263,736 \div 37 = 7,128$, which is an integer.

It got so close.

Due to the scoring scheme, even though this model progressed slightly more on the problem, it gets fewer points on average than gpt-4o. This is because a mistake in divisibility costs 0.1 points but forgetting the lucky constraint means leaving 0.3 points on the table.

Parting words 🔗

This article became long. I could have split it on problem or model boundary, but I’m already splitting the LLM competition over so many articles. It doesn’t make sense to cause too much fragmentation.

It was so long because the models were very chatty:

[solutions] λ wc --total=only Q*/openai-*/*
173294 1526612 7450407

It took a total of 9 days to go over 173294 lines of output, nearly 7.5 millions of characters. I didn’t spend all hours of these days on just checking the models, but still it took much longer than I initially planned. This scrambled my planned posting schedule, but I hope to catch up with the next articles. Scoring right minus wrong means I have to be very careful with the words that the LLM outputs, cannot just look at the last line to see if the answer is the right one.

See you next time for a similar article about Google’s models. Hopefully, earlier than in 9 days.

PS: Just for fun, according to OpenAI models, I am 438,657,074,807 years old (on the third problem, the maximum number I got from the first one is 807,601).

References:

Henry E. Dudeney, Amusements in Mathematics – Dover, 1970 https://doi.org/10.1080/10511979708965873

Performance numbers 🔗

Interesting model outputs 🔗

The first problem 🔗

Outputs of gpt-3.5-turbo 🔗

Outputs of gpt-4 🔗

Outputs of gpt-4-turbo 🔗

Outputs of gpt-4o 🔗

Outputs of gpt-4o-mini 🔗

Outputs of o1-mini 🔗

The second problem 🔗

Outputs of gpt-3.5-turbo 🔗

Outputs of gpt-4 🔗

Outputs of gpt-4-turbo 🔗

Outputs of gpt-4o 🔗

Outputs of gpt-4o-mini 🔗

Outputs of o1-mini 🔗

The third problem 🔗

Outputs of gpt-3.5-turbo 🔗

Outputs of gpt-4 🔗

Outputs of gpt-4-turbo 🔗

Outputs of gpt-4o 🔗

Outputs of gpt-4o-mini 🔗

Outputs of o1-mini 🔗

Parting words 🔗

References:

Comments:

Outputs of `gpt-3.5-turbo` 🔗

Outputs of `gpt-4` 🔗

Outputs of `gpt-4-turbo` 🔗

Outputs of `gpt-4o` 🔗

Outputs of `gpt-4o-mini` 🔗

Outputs of `o1-mini` 🔗

Outputs of `gpt-3.5-turbo` 🔗

Outputs of `gpt-4` 🔗

Outputs of `gpt-4-turbo` 🔗

Outputs of `gpt-4o` 🔗

Outputs of `gpt-4o-mini` 🔗

Outputs of `o1-mini` 🔗

Outputs of `gpt-3.5-turbo` 🔗

Outputs of `gpt-4` 🔗

Outputs of `gpt-4-turbo` 🔗

Outputs of `gpt-4o` 🔗

Outputs of `gpt-4o-mini` 🔗

Outputs of `o1-mini` 🔗