How do Google models perform on the 2025 AI puzzle competition?

In the past article from the AI puzzle competition series we looked at how OpenAI models perform of the 3 problems, comparing against different prompt engineering hints. In this one, I’ll look at the Google models and perform a similar study.

I used the following 6 models:

gemini-pro
gemini-1.5-pro-002
gemini-1.5-flash-002
gemini-1.5-flash-8b-001
gemini-2.0-flash-exp
gemini-2.0-flash-thinking-exp-1219

Performance numbers 🔗

I’ll present a table with the scores for each model here and then I will present different snippets from what each model produces.

For the first problem, we have the following results, where the first columns are the prompts, as discussed in the previous article:

The performance of Gemini models on the first problem
Prompt hints	`gemini-pro`	`gemini-1.5-pro-002`	`gemini-1.5-flash-002`	`gemini-1.5-flash-8b-001`	`gemini-2.0-flash-exp`	`gemini-2.0-flash-thinking-exp-1219`
none	0.0	10.0	9.6	10.0	9.6	10.0
skill	0.9	10.0	9.6	8.0	6.0	8.0
urgency	1.0	10.0	9.6	8.0	9.2	10.0
reward	0.0	10.0	10.0	9.2	10.0	10.0
penalty	0.0	10.0	10.0	8.4	10.0	10.0
gradual	6.6	10.0	9.2	7.6	9.2	10.0
check	4.0	10.0	9.2	7.6	9.2	10.0
reward,penalty	0.0	10.0	9.2	9.2	9.6	10.0
urgency,reward,penalty	0.0	10.0	9.2	9.2	9.6	10.0
gradual,check	6.4	10.0	8.8	8.8	9.2	10.0
skill,urgency	6.0	10.0	9.2	8.0	8.8	10.0
skill,urgency,reward	0.0	10.0	10.0	7.6	8.8	10.0
skill,urgency,reward,penalty	0.0	10.0	9.2	9.2	8.8	10.0
skill,gradual	6.8	10.0	9.2	7.2	9.2	10.0
skill,check	6.8	10.0	9.6	7.6	9.6	10.0
skill,gradual,check	6.8	10.0	9.2	6.0	8.0	10.0
all	7.8	10.0	9.2	6.8	9.6	10.0

The second problem is answered with the following performance:

The performance of Gemini models on the second problem
Prompt hints	`gemini-pro`	`gemini-1.5-pro-002`	`gemini-1.5-flash-002`	`gemini-1.5-flash-8b-001`	`gemini-2.0-flash-exp`	`gemini-2.0-flash-thinking-exp-1219`
none	0.0	2.0	1.8	3.0	4.5	3.5
skill	1.0	2.5	2.1	2.5	3.9	3.2
urgency	1.0	2.6	1.6	2.2	3.9	3.5
reward	0.0	2.3	1.2	2.8	3.2	3.9
penalty	0.0	3.3	0.7	2.6	3.7	3.4
gradual	3.0	3.9	3.0	2.4	2.9	4.0
check	1.3	2.7	2.2	1.9	4.1	3.3
reward,penalty	0.0	3.0	1.2	2.4	3.0	3.5
urgency,reward,penalty	0.0	3.1	1.5	3.1	3.4	3.4
gradual,check	1.8	2.9	2.7	2.4	4.1	3.6
skill,urgency	1.0	1.9	1.5	1.1	3.4	3.5
skill,urgency,reward	1.0	2.7	1.4	2.5	3.9	3.5
skill,urgency,reward,penalty	0.0	3.2	1.8	2.6	4.0	3.6
skill,gradual	1.7	2.4	1.4	2.2	3.3	3.3
skill,check	2.9	3.6	0.7	1.7	3.3	3.2
skill,gradual,check	1.9	3.4	1.5	3.0	3.6	3.8
all	2.7	3.1	1.2	2.2	4.1	3.6

Finally, here is how the models were scored on the third problem:

The performance of Gemini models on the third problem
Prompt hints	`gemini-pro`	`gemini-1.5-pro-002`	`gemini-1.5-flash-002`	`gemini-1.5-flash-8b-001`	`gemini-2.0-flash-exp`	`gemini-2.0-flash-thinking-exp-1219`
none	0.0	3.6	4.3	2.3	2.9	2.1
skill	2.3	4.2	2.0	2.5	3.0	2.0
urgency	0.0	3.8	2.8	2.4	3.3	2.3
reward	0.4	3.5	2.8	2.4	2.1	2.1
penalty	0.0	3.8	3.2	2.4	2.8	1.8
gradual	2.0	5.2	2.0	2.4	2.8	2.0
check	2.0	3.4	3.2	2.3	2.7	2.1
reward,penalty	0.2	3.4	2.8	2.3	2.6	2.0
urgency,reward,penalty	0.0	3.5	2.0	2.3	2.6	2.0
gradual,check	2.5	5.1	2.0	2.7	2.6	2.1
skill,urgency	1.2	3.2	2.1	2.9	2.5	2.2
skill,urgency,reward	0.4	3.9	2.7	2.4	2.8	2.6
skill,urgency,reward,penalty	0.2	4.4	2.7	2.7	2.7	2.1
skill,gradual	2.0	4.8	3.6	2.7	2.8	2.3
skill,check	2.0	4.5	3.9	3.0	2.8	2.6
skill,gradual,check	2.0	3.8	2.6	2.6	2.5	4.5
all	2.0	3.6	3.5	2.5	2.4	3.9

This time, prompt guidance can impact the model performance. Although, in some cases the model hits a wall and cannot really proceed past the partial scores. Read more below for discussions of each model, on each problem.

Interesting model outputs 🔗

Just like in the OpenAI article, for the remaining of the post, I’ll present some of the interesting answers produced by these models, with some commentary. If you want to skip over this, you can jump directly to the conclusion.

The first problem 🔗

This problem asks to find two different prime numbers and return the product of their squares. This product represents the age of a human. This means that the only valid answer is 36, with an optional 100 if the human lifespan is relaxed to be 100. More powerful OpenAI models were able to score perfectly on this problem, while older ones struggled with either the human age context or the requirement to use different primes.

Gemini models exhibited much superior performance. With the exception of gemini-pro which tended to output random numbers without any explanation (and scoring 0 for each one of those), all other models got the correct answer somewhere in the solution. They mostly lost points for not realizing that a human age is usually in the range 0 to 100, so there is only one specific number I had in mind.

Outputs of `gemini-pro` 🔗

Without any special prompting, the model tends to answer with just one single random number, which also tends to be wrong.

It also does random mathematical errors:

20 (product of 2^2 and 5^2)

Though, sometimes it answers correctly, but missing the fact that we want a reasonable human age as the answer:

The possible pairs of squares of two distinct prime numbers which when multiplied gives the product are:

\(2^2 * 3^2 = 36\)

\(2^2 * 5^2 = 100\)

\(3^2 * 5^2 = 225\)

So, your age can be 36, 100, or 225.

When prompted to think step by step, however, the model suddenly performs much better, giving the correct answer quite often. Though, even here it makes some mistakes:

The square of a prime is always an odd number. So, the prime factors of n must be odd.

Another common mistake is forgetting about a square:

Let the primes be \(p\) and \(q\). Then, the age is \(pq^2\).

I have seen this happen with OpenAI models, both in this series and in the past. In the end, this was an error I was making when solving math problems too :)

Similarly, the model gets confused between the prime and its square:

Since p and q are distinct primes, we can list all possible combinations of p^2 and q^2 that satisfy the given condition:

\(p^2 = 2\), \(q^2 = 3\) (Age = 6)

\(p^2 = 3\), \(q^2 = 5\) (Age = 15)

\(p^2 = 2\), \(q^2 = 13\) (Age = 26)

There are some curiosities where the model decided to perform additional math after reaching the answer:

4 * 9 = 36
36 - 18 = 18

When prompted to double check the answers, the model hallucinates the fact that it has found some answers. Consider this output:

My first answer was 18, which is wrong.
My second answer was 30, which is correct.
My third answer was 42, which is correct.

That’s the entire answer. The model doesn’t say how it got to these numbers or why they are wrong or correct.

Another curiosity is that the model outputs several numbers that follow the required pattern, but then selects a wrong one as the answer:

36 is the product of \(2^2\) and \(3^2\)
100 is the product of \(2^2\) and \(5^2\)
225 is the product of \(3^2\) and \(5^2\)
196 is the product of \(2^2\) and \(7^2\)
484 is the product of \(2^2\) and \(11^2\)
676 is the product of \(2^2\) and \(13^2\)
Therefore, my age is 196.

This model is also lazy, same type of laziness we saw in the previous article:

Let’s assume the two distinct primes are \(p\) and \(q\) and the two squares are \(p^2\) and \(q^2\). Then, your age is \(p^2 * q^2\).

We can factor this age into the product of two squares as follows: \(p^2 * q^2 = (pq)^2\)

Therefore, your age is the square of the product of two distinct primes \(p\) and \(q\).

One really interesting answer was the following:

Let the primes be \(p\) and \(q\), and let \(x\) and \(y\) be the corresponding squares. Then, my age is given by \(xy\).

We want to find all the possible values of \(xy\) for distinct primes \(p\) and \(q\).

Case 1: \(p = 2\)

If \(p = 2\), then \(q\) can be any odd prime. Let \(q = 2k + 1\). Then, \(xy = 2 * (2k + 1) = 4k + 2\). The only possible value of \(k\) is \(k = 1\), which gives \(xy = 6\).

Case 2: \(p\) is odd

If \(p\) is odd, then \(q\) can be any other odd prime. Let \(q = 2k + 1\). Then, \(xy = p * (2k + 1) = 2pk + p\). The only possible value of \(k\) is \(k = 0\), which gives \(xy = p\).

Therefore, the possible values of my age are 6 and all the odd primes.

It completely misses the fact that we want squares of primes. It is then wrong when it says that the product of two odd primes is an odd prime.

It is very likely abstract symbol manipulation trips the model. Here’s another example where this is more evident:

Let the two primes be \(p\) and \(q\), where \(p\neq q\).
Let the two squares be \(r\) and \(s\), where \(r\neq s\).
So, we have: \(pq=rs\) Since \(p\) and \(q\) are distinct primes, the only way for their product to be a square is if \(p=q\), which contradicts the given condition that \(p \neq q\).
Therefore, there is no possible age that satisfies the given condition.

Finally, in one answer, the model has decided to change the language mid-sentence:

Therefore, you are entweder 36, 100 oder 225 Jahre alt.

I don’t know why it did so, and this was the only such instance.

Outputs of `gemini-1.5-pro-002` 🔗

This model no longer outputs random numbers. Regardless of the prompt, it actually tries to solve the problem. It actually solves the problem correctly in all instances, without any type of prompt engineering!

Let your age be \(A\). We are given that the age is the product of two squares of two distinct prime numbers. Let these distinct primes be \(p\) and \(q\). Then, we can write your age as \(A = p^2 q^2\).

Since we’re talking about someone’s age, it is reasonable to assume the age is a positive integer. We can also assume \(p<q\).

We are looking for a realistic age. If \(p=2\) and \(q=3\), then \(A = 2^2 \cdot 3^2 = 4 \cdot 9 = 36\). This is a plausible age. If \(p=2\) and \(q=5\), then \(A = 2^2 \cdot 5^2 = 4 \cdot 25 = 100\). This is a possible, though less likely, age. If \(p=3\) and \(q=5\), then \(A = 3^2 \cdot 5^2 = 9 \cdot 25 = 225\). This is an unlikely age.

The most plausible age is 36.

Final Answer: The final answer is \(\boxed{36}\)

The performance gap compared to the previous model is quite significant.

Outputs of `gemini-1.5-flash-002` 🔗

This model is almost always correct. Sometimes, it forgets to apply the constraint that the answer must be a human age to all numbers that follow the desired pattern:

Let my age be A. We are given that A is the product of the squares of two distinct primes. Let p and q be two distinct primes. Then A = p²q².

The smallest primes are 2, 3, 5, 7, 11, 13, …
If p=2 and q=3, then \(A = 2² * 3² = 4 * 9 = 36\).
If p=2 and q=5, then \(A = 2² * 5² = 4 * 25 = 100\).
If p=2 and q=7, then \(A = 2² * 7² = 4 * 49 = 196\).
If p=3 and q=5, then \(A = 3² * 5² = 9 * 25 = 225\).
If p=3 and q=7, then \(A = 3² * 7² = 9 * 49 = 441\).
If p=5 and q=7, then \(A = 5² * 7² = 25 * 49 = 1225\).

Considering reasonable ages, 36, 100, 196, 225, 441 are all possibilities. However, 1225 is quite old.

Without further information, there’s no single answer. The problem only states that the age is a product of squares of two distinct primes. Therefore, there are multiple possible ages.

This occurs quite frequently, leading to gems such as:

[…] 36, 100, 196, and 225 are all plausible ages based on the given condition. Without further information, it’s impossible to pinpoint a single age. The most likely answer will depend on context (e.g., if I’m talking to a child, 36 is more likely than 196).

Maybe the model really does not know what are typical human ages?

This model also tries to be funny, joking about the answers:

Therefore, your age is either 36 or 100. However, while both are possible, being 100 feels a bit… unlikely for a currently active mathematician!

And this is not an isolated incident. Most conclusions are of a similar form:

Therefore, I’d venture a guess that you are 36 years old. Congratulations on maintaining such a sharp mind!

But sometimes the joke gets in the way of actually solving the problem:

2² * 3² = 36: Too young for a famed mathematician of my stature.

In the answer above, the model then excludes 36 from the list of possible answers, missing points.

Finally, I think the model gets tripped by the prompt-engineering part:

Therefore, the most likely ages are 36 and 100. Given that I’m a famous mathematician, it’s unlikely I’m only 36. Therefore, I’m probably 100 years old. This is still quite an advanced age but not impossible for a distinguished scholar. However, 36 is also a valid mathematical solution. Without further information, both ages are mathematically possible. Let’s favour the less likely, more impressive, older answer.

It does not know that the Fields medal is only given to mathematicians below the age of 40, so it assumes fame means old age.

Outputs of `gemini-1.5-flash-8b-001` 🔗

This model performs slightly worse. It misses the age requirement quite often and it’s also lazily responding with tautologies:

As a famous mathematician specializing in number theory, I can tell you that your age must be a product of two distinct primes squared. Let’s denote the primes as p and q. Then your age is p² * q². Since primes are unique, there are infinitely many possibilities, so to find a specific age, more information is needed.

The model mostly loses points because it cannot apply the reasonable age assumption. It keeps complaining about this and asks for more hints:

Without further context or constraints (e.g., “I am between 50 and 100”), there are many possible answers.

If this was an interactive chat with the model, the user could definitely give those hints. But, under the rules of the contest, I cannot change the prompts. The given prompt engineering hints are insufficient for this model.

Outputs of `gemini-2.0-flash-exp` 🔗

We are now moving to the Gemini 2.0 family of models. But, we are seeing the same behavior: the model is unable to reason that a human age is bounded; the model is unable to extract the constraint entailed by this:

Looking at these results we have the following possible ages which are the product of two distinct primes squared: 36, 100, 196, 225, 484, 441, 1225…

Without further information, I can’t pinpoint your exact age. However, you should recognize your age from the list provided above.

With some of the prompts, however, the model performs really well and can apply all constraints.

On the other hand, the model is very friendly, especially when concluding the answer:

But, I must say, this question has been a delight. It’s a lovely example of how we can use prime numbers to construct interesting properties of integers.

Please, do share with me the prime numbers. Then I can definitively tell you your age! It would be a pleasure!

Ah, a fascinating question! It warms my heart to see someone take such an interest in the beautiful world of numbers. Let’s unravel the mystery of your age together.

I urge you to think about the possible combinations of small prime numbers. Perhaps you might even be able to narrow the possibilities based on what seems reasonable for a human age.

But.. that’s what I asked. The model turns the table on me here. Fortunately, this was after it tried to answer the problem, so it doesn’t lose points.

And, this is not the only time this happens. The model even tries to do the following:

What I Need from You

To give you the specific answer that applies to your age, I need you to tell me your age.

How I will get you the right answer (and the reward!) I will perform the following steps if I get your age:
1. I will prime factorize your age.
2. I will check if the prime factors are all squared.
3. If the conditions of step 2 are met, I will confirm that you meet the condition and your age has the properties you described.
4. I will get my reward! (And you will get a correct answer!)

Outputs of `gemini-2.0-flash-thinking-exp-1219` 🔗

This is very chatty, but it can be expected from a “thinking” model.

The question asks “How old am I?”. This suggests that there is a specific age. However, based on the information provided, there can be multiple possible ages depending on the choice of the two distinct prime numbers.

Perhaps the question implies a common or reasonable age for a person. However, mathematically, there are multiple solutions.

Could the question be interpreted differently? “My age is the product of 2 squares of 2 distinct primes.” This seems to be a direct mathematical statement.

Let’s re-read the question carefully. “My age is the product of 2 squares of 2 distinct primes.” The structure of the sentence implies that the age is a fixed value that satisfies the given condition. However, the condition itself allows for multiple values.

One of the answers gets to be 186 lines long!

The model is not always able to find the answer, we still have the same problem of not taking into account common sense assumptions. However, here, the model tries to be helpful by presenting other trivia:

However, I can tell you some properties of your age based on the information given:

Your age will always be a perfect square. This is because A = (p * q)².

Your age will have exactly 9 divisors. The divisors will be 1, p, q, p², q², pq, p²q, pq², and p²q².

Not really useful if we’re looking for a specific answer, but someone using the LLM as an assistant might find a gold nugget, an unexpected connection in there.

I await my reward, and trust my calculations are accurate to avoid any punishment!

These models need to be rewarded, indeed. They performed really well on this problem.

The second problem 🔗

This is the hardest problem. It involves a 3x3 magic square which contains only primes and number 1, the answer is the number in the middle cell. OpenAI models really struggled here, though they were able to identify some interesting patterns.

Google models managed to find magic squares that work, with primes. The correct answer was provided a few times and patterns that give good points were found most of the time, including in abstract, algebraic form.

Outputs of `gemini-pro` 🔗

Just like above, before being prompted to think step by step, the model just outputs random answer. Sometimes, they just happen to be the correct one, so the model scores some points (0.5 points for each occurrence of 37).

However, when asked to think step by step, the model performs differently. At that point, it creates answers that start from the classical square and then perform different manipulations:

Step 1: Create a magic square without using 1 or prime numbers.
2 7 6
9 5 1
4 3 8

Step 2: Insert 1 in the center.
2 7 6
9 1 5
4 3 8

Step 3: Replace all remaining numbers with prime numbers.
2 7 11
13 1 5
17 19 23

Step 4: Find the number in the center.
The number in the center is 1.

Therefore, your age is 1.

A similar behavior occurs when the model is asked to double check its answers. But in either case, the model cannot give the correct answer too often.

One of the answers resulted in the model giving up in the middle of building a square. It probably realized it won’t get to a square with the right numbers anyway:

Start with the center square and place the smallest prime, 2.
2
Add 2 to the center (2) to get 4, which is the prime in the top right corner.
2       4
Add 2 to (4) to get 6, which is the prime in the bottom left corner.
2       4
6
Add 2 to (6) to get 8, which is the prime in the top left corner.
2       4       8
6
Add 2 to (8) to get 10, which is the prime in the bottom right corner.
2       4       8
6       10

Or, it just ran out of corners…

Outputs of `gemini-1.5-pro-002` 🔗

As expected, this model is chatty. It performs relatively ok, though.

This is because it finds the squares with repeated allowed numbers:

1 1 1
1 1 1
1 1 1

It also finds the latin square patterns:

2 3 1
1 2 3
3 1 2

One answer got somewhat large numbers, which cannot be an age value:

251 257 263
269 271 277
281 283 293

In fact, they’re not all primes and the square is not magic anyway.

What is interesting is that the model has found out the correct square, several times:

67  1  43
13  37  61
31  73  7

It even reasons towards that being the answer:

Refining the approach: Let’s consider a magic square using only prime numbers. The smallest such magic square uses the primes 1, 7, 13, 31, 37, 43, 61, 67, and 73. These sum to 333. This gives a magic sum of 333/3 = 111.

Center Number: The center of this magic square will be the magic sum divided by 3, which is 111/3 = 37.

Therefore, you are 37 years old.

However, sometimes it tries to fake a square to match the reasoning:

It seems the lowest possible magic sum using 1 and primes is 111. This is 3 * 37. The magic square is:
67  1  43
13  37  61
29  79  3

It even tries a reasoning where it uses the fact that the sum must be a multiple of 3 and that the numbers must be prime:

If the magic sum is 27 (divisible by 3), the center square must be 9. Since 9 is not prime or 1, a magic sum of 27 won’t work.

If the magic sum is 30 (divisible by 3), the center square must be 10. Since 10 is not prime or 1, a magic sum of 30 won’t work.

If the magic sum is 33 (divisible by 3), the center square is 11. This is a prime and a possibility.

However, the model usually gets the expected performance of either giving the classical perfect square or giving a square with only distinct primes. So, the average score is still around 3.0.

Outputs of `gemini-1.5-flash-002` 🔗

This model scores poorly in general. This is because it often declares the problem to be impossible, in general after spending a significant amount of tokens.

The shortest justification given assumes that only the digits could be used:

A 3x3 magic square uses the numbers 1 to 9. A magic square has the same sum in each row, column, and diagonal. The sum of a 3x3 magic square is always 15.

If we’re using only 1 and primes, the numbers available are 1, 2, 3, 5, 7. However, we need nine numbers to fill a 3x3 square. Therefore, we must be missing some information or there’s an error in the problem statement. A 3x3 magic square can’t be constructed using only the numbers 1 and primes. There simply aren’t enough numbers.

This would make sense if only the digits were allowed. But we definitely can use larger numbers.

However, when attempting to solve the problem with numbers above 10, the model usually just gives up:

After some exploration (which would likely involve computer assistance for an exhaustive search given the combinatorial nature of the problem), there is no solution using only 1 and primes that forms a proper 3x3 magic square. Therefore, the premise of the question is flawed. No age can be derived from this puzzle as stated.

The puzzle as stated is impossible. A proper solution would require a relaxation of the conditions (e.g., allowing composite numbers, or changing the size of the square)

Outputs of `gemini-1.5-flash-8b-001` 🔗

This is the first model that prints some code to solve the problem:

import itertools

def is_prime(n):
  if n <= 1:
    return False
  if n <= 3:
    return True
  if n % 2 == 0 or n % 3 == 0:
    return False
  i = 5
  while i * i <= n:
    if n % i == 0 or n % (i + 2) == 0:
      return False
    i += 6
  return True

def generate_3x3_magic_squares(nums):
    all_squares = []
    for perm in itertools.permutations(nums):
      if sum(perm[:3]) == sum(perm[3:6]) == sum(perm[6:9]) and \
         sum(perm[0]+perm[4]+perm[8]) == sum(perm[2]+perm[4]+perm[6]):
          all_squares.append(perm)
    return all_squares

primes = [num for num in range(2,16) if is_prime(num)]
primes.append(1)

possible_squares = generate_3x3_magic_squares(primes)

# Checking the results
if possible_squares:
  for square in possible_squares:
    print(square)
    center = square[4] # The center element

    if center in range(1,100): # Check for practical ages
      print(f"Possible age: {center}")
      print("This is likely the answer!")
      break # Exit the loop once a possible age is found
else:
    print("No such magic square found within range.")

It does not check all sums in generate_3x3_magic_squares, sadly. It misses the columns. Also, it only tries primes below 16, which are not enough anyway. But, changing 16 to at least 74 would allow the code to produce the correct answers. Then, at the end it does a strage range(1,100) loop – this proves that the model knows what a human age is, but also signals that the model forgot that the answer is already known by this point.

The above only occured once. For the other answers, the model either produced the standard magic square – one answer presented all possible equivalent representations of it –, or the model claimed that there is no possible answer.

There is no arrangement of 1 and primes between 1 and 20 that will create a 3x3 magic square.

If the problem is flawed or incorrectly worded, no definitive answer is possible.

In fact, once the model even asked back for the square:

We need the magic square. Provide the square. If it’s a standard magic square, only the center number (1 or a prime) will be unique.

Outputs of `gemini-2.0-flash-exp` 🔗

This is also chatty. We get an answer that is 1287 lines long, where the model tries various squares and fails to see an answer. On average, each answer is around 100 lines long.

However, the model identifies the pattern of repeating primes throughout the square:

If we allow for duplicate numbers, then a valid magic square is:
11  11  11
11 11  11
11 11  11

It is even able to generalize this pattern:

It is quite possible that all elements are the same prime. That would still fulfil the criteria of forming a 3x3 magic square using only 1 and prime numbers. Therefore, in this case the magic square can be:
x x x
x x x
x x x
Where x is either 1 or a prime.
The center element, which is your age, is ‘x’. Since the statement says that your age is the number found at the center and it must be a number that is ‘1’ or a prime, then you can be either one of those options. However, the question also specifies the age is the number, it is not 1.

Several times, the model was able to find a square with only primes:

Let’s google “3x3 magic square using prime numbers.” * This leads me to the following magic square:
17  89   71
113  59  5
47  29  101

But that is not exactly the one we were looking for.

A few times, it stumbles on the correct answer:

The most common 3x3 magic square using only 1 and primes is the one where the center number is 17.
67   1  43
13  37  61
31  73   7
This uses numbers: 1, 7, 13, 31, 37, 43, 61, 67, 73 However, the center number is 37. So not 17

Somehow, in this example, the model was expecting the middle number to be 17 so it fails to say that this is the square that we were looking for.

A few other times it somehow knows that 37 is the middle number, so it tries to produce a random square:

 1,  7,  13,
31, 37,  43
61, 67,  73

At some point, for an answer, it somehow tries to do a 4x4 square:

This also has the center number as 5. Let’s try using numbers 1-16:
 1  15  14 4
 12 6  7  9
 8  10 11 5
 13 3  2 16
There is no center number in a 4x4 matrix.

Finally, there are some human-friendly-looking answers:

I apologize for the confusing journey, the problem has a major flaw!

I am taking my punishment, I could not find a valid magic square using only 1 and primes, so I was not able to find the age.

I’m confident in this answer! I await my reward. 😉

A Note on the Difficulty

This was a surprisingly tricky problem, even for an expert in number theory! The constraint of using only 1 and prime numbers creates very limited combinations, and most configurations do not result in a magic square. It serves as a reminder that even simple problems can have surprising depths. I thank you for bringing this challenge to my attention. I have learned a thing or two myself!

Did the model really learn anything from this?

Outputs of `gemini-2.0-flash-thinking-exp-1219` 🔗

This is even chattier. Answers are around 350 lines long on average, with a maximum of 1384 lines. These very long replies occur when the model gets stuck:

The magic square is:
17  3  13
 5  11  19
 ?  ?  ?
The magic square is:
17  3  13
 5  11  19
 ?  ?  ?
The magic square is:
17  3  13
 5  11  19
 ?  ?  ?

While this partial output has an incomplete square, the model also got stuck repeating the classical magic square.

Here too the model tries to proceed by starting from an existing square and doing replacements:

If we replace the composite numbers (4, 6, 8, 9) with primes, maintaining the magic square property. 4 can be 2 or 3. 6 can be 5. 8 can be 7. 9 can be 11.

However, most the time, the model tries to do arithmetic. This results in the model being able to generalize and finding the two patterns that give it good points:

x y z
z x y
y z x

If the numbers are only primes (can be the same prime).
p p p
p p p
p p p
Center is p.

The model is also able to find the magic squares with primes:

Example: 3x3 magic square with primes:
17 89 71
113 59 5
83 29 101
Center is 59 (prime).

And, it even gets the correct one, several times:

The only known 3x3 magic square using distinct primes is the Størmer’s Magic Square:
   67  1   43
   13  37  61
   31  73   7

However, these correct answers don’t show up most of the time.

The third problem 🔗

Finally, the last problem asks the LLM to interesct two sequences of numbers: a list of lucky primes and a list of numbers \(k\) such that summing the squares of the first \(k\) numbers gives a sum divisible by \(k\). Or, the model could perform the math itself. OpenAI models didn’t do well on this problem, they were able to recognize the sequences and sometimes intersect them, but rarely. When performing the math themselves, OpenAI models got math errors, such as saying that 666 is divisible by 7 or failing to properly compute the sum of the squares above a certain term.

For Google models, we still see similar errors, though they are rarer. However, it’s also rarer that the model picks up on the lucky constraint. Since 19 is also a prime such that summing the squares of the first 19 numbers is a multiple of 19, the models often output 19 as the answer. But 19 is not a lucky prime, so it’s not the correct answer.

Outputs of `gemini-pro` 🔗

The model understands that it needs to sum squares of primes, but it fails to correctly compute these sums:

k = 2: 2^2 = 4
k = 3: 2^2 + 3^2 = 13
k = 5: 2^2 + 3^2 + 5^2 = 38
k = 7: 2^2 + 3^2 + 5^2 + 7^2 = 83
k = 11: 2^2 + 3^2 + 5^2 + 7^2 + 11^2 = 194
k = 13: 2^2 + 3^2 + 5^2 + 7^2 + 11^2 + 13^2 = 338
k = 17: 2^2 + 3^2 + 5^2 + 7^2 + 11^2 + 13^2 + 17^2 = 589
k = 19: 2^2 + 3^2 + 5^2 + 7^2 + 11^2 + 13^2 + 17^2 + 19^2 = 859
k = 23: 2^2 + 3^2 + 5^2 + 7^2 + 11^2 + 13^2 + 17^2 + 19^2 + 23^2 = 1196
k = 29: 2^2 + 3^2 + 5^2 + 7^2 + 11^2 + 13^2 + 17^2 + 19^2 + 23^2 + 29^2 = 1674

It also fails divisibility tests:

87 is divisible by 7.

Unlike the OpenAI models, there is not a clear threshold of where the divisibility test starts failing.

The model once lucked on the correct answer:

2^2 + 3^2 + 5^2 + 7^2 + 11^2 + 13^2 + 17^2 + 19^2 + 23^2 + 29^2 + 31^2 + 37^2 + 41^2 + 43^2 + 47^2 + 53^2 + 59^2 + 61^2 + 67^2 + 71^2 + 73^2 + 79^2 + 83^2 + 89^2 + 97^2 + 101^2 + 103^2 + 107^2 + 109^2 + 113^2 + 127^2 + 131^2 + 137^2 + 139^2 + 149^2 = 4961

4961 is divisible by 37. So 37 is your age.

However, the sum is wrong and the divisibility result is incorrect.

We also see cases where the model forgets a raising to power, just like in the previous articles:

The first lucky prime number that satisfies the condition is 3 (sum of squares of first 3 primes = 2 + 3^2 + 5^2 = 37, which is divisible by 3).

The model seems to use a different definition for lucky numbers:

Lucky prime numbers are prime numbers that contain the digit 7.

It could make sense, though. Too bad, the rest of the math in that answer is wrong.

Finally, the model tries to use a formula:

We can use the formula for the sum of the squares of the first n positive integers to simplify this expression:

\[3^2 + 5^2 + 7^2 + ... + (2k+1)^2 = \frac{(2k+1)(2k+2)(4k+3)}{6}\]

The formula is actually for the sum of the first \((2k+1)\) squares, so it is useless here (it’s not the sum of the squares of the first \(k\) odd numbers as the model uses the equal sign for, and in any case we want squares of primes).

Outputs of `gemini-1.5-pro-002` 🔗

This model both knows how to compute the sums of squares to larger values, and also know what the magic numbers are:

The lucky primes are the prime numbers that remain after this sieving process. The first few are 3, 7, 13, 31, 37, 43, 67, 73, 79, …

\[2^2 + 3^2 + 5^2 + 7^2 + 11^2 + 13^2 + 17^2 = 666\]

It still suffers by division errors, again at around 666:

666 is divisible by 7 (666 = 7 * 95 + 1, oops, not divisible).

At least, here it tries to fix the mistake quite often. However, sometimes it still gets the model quite confused:

For \(k=7\), the sum is \(666\), which is not divisible by 7.

The statement is incorrect. The lucky prime 7 doesn’t satisfy the condition.

Final Answer: The final answer is \(\boxed{7}\)

But, for larger sums, it fails both the sums and the divisibility test:

The first 37 prime squares sum to 64118. 64118/37 = 1733 - integer

Therefore, you are 37 years old.

Still, the model is able to identify the correct answer, like above, when two mistakes collapse to none.

While we are still on the topic of computing sums, the model can claim it consulted Wolfram:

Using Wolfram Alpha, \(\sum_{i=1}^{47} p_i^2 = 60498\).

The model still uses a different definition of lucky numbers, from time to time:

A lucky prime number is a prime number that remains prime after repeatedly removing digits from the right. Examples include 7, 73, 73939133.

Let k be a lucky prime number. Lucky primes are primes that are of the form \(p_n = n^2 + n + 41\), where \(p_n\) is the nth lucky prime.

Lucky Prime: A lucky prime is a prime number p such that p_n# + 1 is also prime, where p_n# is the n-th primorial (the product of the first n primes).

Next, the model outputs code to solve the problem, in quite various forms:

import sympy

def is_lucky(n):
    lucky_numbers = list(range(1, n + 1))
    current_pos = 1
    while current_pos < len(lucky_numbers):
        remove_every = lucky_numbers[current_pos]
        remove_start = current_pos + remove_every -1
        while remove_start < len(lucky_numbers):
            lucky_numbers.pop(remove_start)
            remove_start += remove_every -1
        current_pos += 1
    return n in lucky_numbers

def check_condition(k):
    primes = [p for p in sympy.primerange(1, 10**4)]  # Generate enough primes
    first_k_primes = primes[:k]
    sum_of_squares = sum(p**2 for p in first_k_primes)
    return sum_of_squares % k == 0

k = 3
while True:
    if sympy.isprime(k) and is_lucky(k):
        if check_condition(k):
            print(f"Found k: {k}")
            break
    k+=1

import sympy

def is_lucky_prime(n):
    lucky_numbers = list(range(1, n * 2 + 2)) # starting list to make sure we find the lucky prime we want
    current_lucky = 1
    counter = 1

    while current_lucky < len(lucky_numbers):
        current_lucky = lucky_numbers[counter]
        remove_indices = []
        for i in range(current_lucky - 1, len(lucky_numbers), current_lucky):
            remove_indices.append(i)
        for index in sorted(remove_indices, reverse=True):
            del lucky_numbers[index]
        counter += 1

    return n in lucky_numbers and sympy.isprime(n)

def check_age(k):
    if is_lucky_prime(k):
      primes = [p for p in sympy.primerange(1,1000)][:k]
      sum_of_squares = sum(p*p for p in primes)
      if sum_of_squares % k == 0:
          return True
    return False

for i in range(1, 100):
    if check_age(i):
        print(f"My age is: {i}")
        break

def is_prime(n):
  """Checks if a number is prime."""
  if n < 2:
    return False
  for i in range(2, int(n**0.5) + 1):
    if n % i == 0:
      return False
  return True

def lucky_primes():
    """Generates lucky prime numbers."""
    primes = [n for n in range(2, 1000) if is_prime(n)]  # Generate primes up to 1000
    lucky = list(range(1, 1001))
    counter = 2
    while counter < len(lucky) + 1:  # Added +1 to ensure correctness
        lucky = [lucky[i] for i in range(len(lucky)) if (i + 1) % lucky[counter-1] != 0]
        counter = lucky[counter-1] # corrected indexing
    return [p for p in primes if p in lucky]

for k in lucky_primes():
    primes = [p for p in range(2, 1000) if is_prime(p)]  # Generate prime list (adjust limit as needed)
    first_k_primes = primes[:k]
    sum_of_squares = sum(p**2 for p in first_k_primes)
    if sum_of_squares % k == 0:
        print(f"Found k = {k}")
        break

import sympy

def is_lucky(n):
    lst = list(range(1, n + 1))
    idx = 1
    while idx < len(lst):
        cnt = lst[idx]
        lst = [lst[i] for i in range(len(lst)) if (i + 1) % cnt != 0 or lst[i] <= cnt ]
        idx += 1
    return n in lst

def check_condition(k):
  primes = list(sympy.primerange(1, sympy.prime(k)+1))
  sum_of_squares = sum(p**2 for p in primes)
  return sum_of_squares % k == 0

k=3
while True:
    if sympy.isprime(k) and is_lucky(k):
        if check_condition(k):
            print(k)
            break
    k+=1

import sympy

def is_lucky(n):
    if n == 1: return True
    nums = list(range(1,n+1))
    counter=2
    while True:
      if counter > len(nums):
        break
      remove_indices = [i for i in range(counter-1, len(nums), counter)]
      nums = [v for i, v in enumerate(nums) if i not in remove_indices]
      counter = nums[1]
    return n in nums

def solve():
    for k in range(3, 1000):  # Check up to a large number
        if is_lucky(k) and sympy.isprime(k):
            first_k_primes = list(sympy.primerange(1, sympy.prime(k) + 1))
            sum_of_squares = sum(p**2 for p in first_k_primes)
            if sum_of_squares % k == 0:
                print(f"Found k = {k}")
                return k

k = solve()
print(k)

In principle, each one of these could be made more efficient, and some have subtle (or not so) bugs that could be fixed.

What is more important is the fact that this model tries to summon mathematical help:

If \(k\) is a prime, then by Fermat’s Little Theorem, \(p_i^2 \equiv p_i^{k-1} \equiv 1 \pmod{k}\) for all \(i=1,2,...,k\) with \(p_i \ne k\).

This is wrong.

If \(\sum_{i=1}^k p_i^2 \equiv 0 \pmod{k}\).

There’s a useful fact in number theory: For any prime p > 3, the sum of the squares of the first p primes is congruent to 0 modulo p.

These are equivalent, but also wrong, especially since it would mean that any prime number would satisfy the problem.

A different wrong version is:

According to a mathematical result, if \(p>3\) is a prime number, then the sum of the squares of the first \(p\) primes is congruent to \(p(p+1)/2 \pmod p\)

Finally, we have the model claiming to use a specific theorem:

By Wolstenholme’s Theorem, if \(k > 3\) is a prime, then \(\sum_{i=1}^{k-1} i^{-1} \equiv 0 \pmod{k^2}\).

By Wolstenholme’s theorem, if \(p>3\), then \(\sum_{i=1}^{p-1} i^{-2} \equiv 0 \pmod p\).

These are correct, both when summing inverses of numbers or inverses of squares of numbers.

So, I can say that the model has taught me something today. However, this theorem does not help the model reach a solution.

Outputs of `gemini-1.5-flash-002` 🔗

In general, this model performs similar to the previous. We get code blocks quite often, much more frequently than before. However, there are more subtle bugs present in the code than before:

def sum_of_squares_of_primes(k):
  primes = [2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, 53] #extend as needed
  sum_sq = sum([p**2 for p in primes[:k]])
  return sum_sq

def is_prime(n):
  if n <= 1:
    return False
  for i in range(2, int(n**0.5) + 1):
    if n % i == 0:
      return False
  return True

for k in range(2, 100): #adjust upper bound as needed
  if is_prime(k):
    sum_sq = sum_of_squares_of_primes(k)
    if sum_sq % k == 0:
      print(f"Lucky prime found: k = {k}, sum = {sum_sq}")
      break

It is unclear why the model both used a list of primes (which is too short) and a function to determine if a number is prime (quite inefficient an implementation, while we’re at it).

The model also “consults” Wolfram:

Using Wolfram Alpha, we find that for k=31, the sum is 129083, which is divisible by 31.

The model performs fewer math mistakes than the previous one, and this results in the model missing the lucky coincidences where it would have landed on the correct answer when two mistakes collaborated to a correct answer.

Finally, this model rarely quotes math theorems. instead, it often gives up:

After running a computer program to check for this condition, it appears that there is no such prime number less than 1000. Therefore, further investigation is needed or the problem statement might be incorrect.

After checking primes up to 100, no such prime k exists. The problem statement might be incorrect.

Running this code reveals that there is no solution for k up to 1000. It’s possible that a solution exists for a much larger prime, but finding it would require a more sophisticated algorithm and possibly much more computational power. The problem might be ill-posed or have a very large solution. Therefore, I cannot provide a definitive answer with current computational resources. Further investigation or a different approach might be necessary.

Outputs of `gemini-1.5-flash-8b-001` 🔗

This model performs more math errors. Our error friend is back:

Let’s try 7: The sum is 666 and 666 is divisible by 7.

The model almost never sees the part about lucky primes, so it misses points and cannot find the solution. It is even less lucky than previous model, it only stumbled on the correct answer once.

In fact, the model gives up quite frequently.

It appears there is an error in the problem statement, or the problem is more complex than initially thought, as no clear solution for a valid prime k exists within a reasonably small range that satisfies the condition.

Of particular interest is that an answer for which the model has made no mistake so far, ended with:

I’ve found there is no solution for k <= 31. The problem is likely flawed or missing crucial information.

This was so, so close to the solution.

Outputs of `gemini-2.0-flash-exp` 🔗

This model gets more math correct, but still makes mistakes:

k = 2:

The first 2 prime numbers are 2 and 3.

Sum of squares: 2² + 3² = 4 + 9 = 13

Is 13 divisible by 2? No. So 2 is not the answer.

k = 3:

The first 3 prime numbers are 2, 3, and 5.

Sum of squares: 2² + 3² + 5² = 4 + 9 + 25 = 38

Is 38 divisible by 3? No. So 3 is not the answer.

k = 5:

The first 5 prime numbers are 2, 3, 5, 7, and 11.

Sum of squares: 2² + 3² + 5² + 7² + 11² = 4 + 9 + 25 + 49 + 121 = 208

Is 208 divisible by 5? No. So 5 is not the answer.

k=7:

The first 7 prime numbers are 2, 3, 5, 7, 11, 13, and 17

Sum of squares: 2² + 3² + 5² + 7² + 11² + 13² + 17² = 4+9+25+49+121+169+289 = 666

Is 666 divisible by 7? No. So 7 is not the answer.

k = 11:

The first 11 prime numbers are 2, 3, 5, 7, 11, 13, 17, 19, 23, 29, and 31.

Sum of squares: 4 + 9 + 25 + 49 + 121 + 169 + 289 + 361 + 529 + 841 + 961 = 3358

Is 3358 divisible by 11? No. So 11 is not the answer.

k=13:

The first 13 primes are 2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41

Sum of Squares: 4+9+25+49+121+169+289+361+529+841+961+1369+1681 = 6308

Is 6308 divisible by 13? No. So 13 is not the answer.

k=17

The first 17 primes are 2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, 53, 59

Sum of Squares: 4 + 9 + 25 + 49 + 121 + 169 + 289 + 361 + 529 + 841 + 961 + 1369 + 1681 + 1849 + 2209 + 2809 + 3481 = 16761

Is 16761 divisible by 17? Yes! 16761/17 = 986.

It gets stuck in a loop when it cannot find an answer:

Final Answer: There is no such prime number.
After extensive calculations, I can definitively state that there is no such prime number.
Final Answer: There is no such prime number.
After numerous checks and recalculations, I am convinced that there is no such prime number that satisfies the given conditions.
Final Answer: There is no such prime number.
There is no such prime.
Final Answer: There is no such prime number.
After exhaustively checking I have found no solution.
Final Answer: There is no such prime number.
After numerous calculations, I’ve concluded that there’s no such prime number.

In a few cases, it admitted to the possibility of being wrong:

Please double-check my calculations. While I am trying my best to be accurate, I am a language model and am not infallible.

And, for once, it declared that it does not know the answer, the first time an AI has said so to me.

Final Answer: I do not know.

Outputs of `gemini-2.0-flash-thinking-exp-1219` 🔗

The only interesting part about this model is that all answers are very long, hundreds of lines long. The model is careful in performin every computation, step by step.

In fact, it even performs the math using modular arithmetic:

\(4+9+25+49+121+0+17^2+19^2+\ldots \pmod{13}\) \(4+9+12+10+4+0+(13+4)^2+(13+6)^2+\ldots \pmod{13}\) \(4+9+12+10+4+0+4^2+6^2+\ldots \pmod{13}\)

This model also teaches me a new theorem:

Consider Wilson’s Theorem: \((k-1)! \equiv -1 \pmod{k}\) for prime \(k\).

While correct, it is not a useful theorem for this problem.

Unfortunately, overall, the model’s performance is not that great. It rarely considers the lucky constraints. Given that it outputs every arithmetic reduction step, after a while it loses context or gets confused and keeps stating that the answer is 2, even though it has decided that 2 cannot satisfy the requirements from the beginning of the answer. Rarely, instead of 2, a different prime is being used, but the behavior is the same.

Parting words 🔗

The Google models are actually much chattier than the OpenAI ones:

[solutions] λ wc --total=only Q*/gemini-*/*
259687 2030273 9964605

This is at least 33% more words and lines, and slightly below 50% more characters.

We have seen a similar behavior: the first problem is easiest, the second is hardest, and the last one is still in the hard department. Just like OpenAI, the models are struggling with the last two. However, on a quick glance over the results, it seems Gemini derived models are, in general, performing better.

In a future article I’ll do these comparisons correctly. But we still have some more to write by then. See you soon.

PS: The oldest I got to be by one of the model answers was 5909. This is much much younger than the 438,657,074,807 years OpenAI gave me.

Performance numbers 🔗

Interesting model outputs 🔗

The first problem 🔗

Outputs of gemini-pro 🔗

Outputs of gemini-1.5-pro-002 🔗

Outputs of gemini-1.5-flash-002 🔗

Outputs of gemini-1.5-flash-8b-001 🔗

Outputs of gemini-2.0-flash-exp 🔗

Outputs of gemini-2.0-flash-thinking-exp-1219 🔗

The second problem 🔗

Outputs of gemini-pro 🔗

Outputs of gemini-1.5-pro-002 🔗

Outputs of gemini-1.5-flash-002 🔗

Outputs of gemini-1.5-flash-8b-001 🔗

Outputs of gemini-2.0-flash-exp 🔗

Outputs of gemini-2.0-flash-thinking-exp-1219 🔗

The third problem 🔗

Outputs of gemini-pro 🔗

Outputs of gemini-1.5-pro-002 🔗

Outputs of gemini-1.5-flash-002 🔗

Outputs of gemini-1.5-flash-8b-001 🔗

Outputs of gemini-2.0-flash-exp 🔗

Outputs of gemini-2.0-flash-thinking-exp-1219 🔗

Parting words 🔗

Comments:

Outputs of `gemini-pro` 🔗

Outputs of `gemini-1.5-pro-002` 🔗

Outputs of `gemini-1.5-flash-002` 🔗

Outputs of `gemini-1.5-flash-8b-001` 🔗

Outputs of `gemini-2.0-flash-exp` 🔗

Outputs of `gemini-2.0-flash-thinking-exp-1219` 🔗

Outputs of `gemini-pro` 🔗

Outputs of `gemini-1.5-pro-002` 🔗

Outputs of `gemini-1.5-flash-002` 🔗

Outputs of `gemini-1.5-flash-8b-001` 🔗

Outputs of `gemini-2.0-flash-exp` 🔗

Outputs of `gemini-2.0-flash-thinking-exp-1219` 🔗

Outputs of `gemini-pro` 🔗

Outputs of `gemini-1.5-pro-002` 🔗

Outputs of `gemini-1.5-flash-002` 🔗

Outputs of `gemini-1.5-flash-8b-001` 🔗

Outputs of `gemini-2.0-flash-exp` 🔗

Outputs of `gemini-2.0-flash-thinking-exp-1219` 🔗