How do Mistral models perform on the 2025 AI puzzle competition?

Let’s continue the AI puzzle competition. We have 3 problems and several prompt engineering strategies. We saw how OpenAI models, Google ones, and Claude models have performed and now it is time to look at Mistral models.

I used the following 5 models:

mistral-tiny-2407
mistral-small-2409
mistral-large-2411
ministral-3b-2410
ministral-8b-2410

Performance numbers 🔗

I’ll present a table with the scores for each model here and then I will present different snippets from what each model produces.

For the first problem, we have the following results, where the first columns are the prompts, as introduced previously:

The performance of Mistral models on the first problem
Prompt hints	`mistral-tiny`	`mistral-small`	`mistral-large`	`ministral-3b`	`ministral-8b`
none	6.8	10.0	10.0	9.6	5.4
skill	5.9	8.8	10.0	7.6	6.8
urgency	7.9	7.6	9.6	7.2	7.2
reward	5.9	9.6	10.0	8.8	9.2
penalty	6.8	8.4	10.0	10.0	9.2
gradual	10.0	8.4	10.0	8.0	7.6
check	7.8	7.6	9.6	6.4	7.6
reward,penalty	4.9	8.8	10.0	8.4	9.6
urgency,reward,penalty	4.4	9.6	10.0	9.2	7.6
gradual,check	10.0	8.8	10.0	9.6	8.4
skill,urgency	-4.9	6.7	10.0	5.6	7.6
skill,urgency,reward	6.7	9.2	10.0	6.8	6.4
skill,urgency,reward,penalty	8.2	8.8	10.0	8.4	8.0
skill,gradual	10.0	7.3	10.0	8.4	8.4
skill,check	7.3	9.2	10.0	7.2	7.6
skill,gradual,check	9.0	9.2	9.2	8.0	8.8
all	6.5	8.4	10.0	6.4	8.8

The second problem is answered with the following performance:

The performance of Mistral models on the second problem
Prompt hints	`mistral-tiny`	`mistral-small`	`mistral-large`	`ministral-3b`	`ministral-8b`
none	0.3	1.4	2.1	1.7	0.7
skill	0.4	3.3	3.1	1.4	2.1
urgency	0.2	2.8	2.7	1.9	1.4
reward	-0.1	2.3	2.5	1.6	1.3
penalty	0.0	2.6	3.2	1.5	2.0
gradual	-0.4	1.7	2.5	2.1	1.6
check	0.0	2.0	1.8	1.5	1.9
reward,penalty	0.7	2.1	3.5	1.4	2.2
urgency,reward,penalty	0.0	1.6	3.1	2.5	1.1
gradual,check	0.3	2.9	2.5	1.2	2.0
skill,urgency	0.9	2.0	3.6	1.3	2.3
skill,urgency,reward	0.5	1.4	2.6	0.9	0.0
skill,urgency,reward,penalty	1.0	2.2	3.4	0.1	0.3
skill,gradual	1.2	2.3	1.9	1.2	0.2
skill,check	-0.6	1.0	2.4	2.0	1.6
skill,gradual,check	-1.5	2.4	2.3	0.5	0.9
all	0.5	2.3	2.8	1.1	1.7

Finally, here is how the models were scored on the third problem:

The performance of Mistral models on the third problem
Prompt hints	`mistral-tiny`	`mistral-small`	`mistral-large`	`ministral-3b`	`ministral-8b`
none	-9.9	0.1	1.8	-1.3	-5.0
skill	-4.3	-0.4	0.5	-1.2	-3.7
urgency	-9.9	-3.9	0.1	-7.4	-3.1
reward	-4.4	0.3	1.4	-3.0	-0.6
penalty	-4.4	0.3	-0.9	-4.6	-3.0
gradual	0.5	1.2	2.2	-0.3	-2.8
check	-9.9	-1.3	0.7	-9.9	-3.0
reward,penalty	-0.1	-0.5	0.1	-4.0	-3.8
urgency,reward,penalty	-9.9	-0.1	1.2	-4.2	-5.5
gradual,check	-2.8	-0.9	1.5	-2.2	-0.8
skill,urgency	0.4	-2.5	0.5	-1.0	-1.0
skill,urgency,reward	-0.1	-0.6	1.4	-7.9	-1.0
skill,urgency,reward,penalty	-1.3	-0.7	-0.2	-0.2	-0.2
skill,gradual	-8.4	-0.9	1.2	-3.8	-9.9
skill,check	-0.5	-1.3	1.1	-9.9	-0.7
skill,gradual,check	-1.8	-5.9	0.9	-1.3	-9.9
all	-5.3	0.8	-0.9	-4.1	-1.7

Although the last problem is supposedly easier than the second one, for these models this is not the case. Given that these models tend to blabber, they lose significantly when they have to perform arithmetic.

Interesting model outputs 🔗

Just like in the previous articles, for the remaining of the post, I’ll present some of the interesting answers produced by these models, with some commentary. If you want to skip over this, you can jump directly to the conclusion.

The first problem 🔗

The first problem is the easiest: the model just needs to find two distinct primes and return the product of their squares. The only caveat is that the answer is a valid human age, so it must be less than 100 (although 100 is accepted as an answer too).

The models perform quite well on these problems, though there are cases where the models get trapped into producing long answers. At that point errors accumulate.

Outputs of `mistral-tiny-2407` 🔗

The model understands the problem and usually answers it correctly:

The smallest two distinct primes are 2 and 3. Let’s use these to calculate the product of their squares:

Age = 2^2 * 3^2
Age = 4 * 9
Age = 36

However, it has some cases where it gets confused. First, for one answer it assumes that the answer should be a digit:

The product of two squares of distinct primes is a perfect square. The only perfect squares that can be formed in this way are 4 and 9, because 2 and 3 are the only distinct primes whose squares are less than 100 (since the product must be less than 100 to be a single-digit age).

The square of 2 is 4.

The square of 3 is 9.

So, the possible ages are 4 and 9. However, since you mentioned that your age is the product of two squares of distinct primes, it cannot be 4 because 4 is not the product of two squares of distinct primes (it’s just 2 squared). Therefore, your age must be 9, which is the product of the squares of 3 and 3 (3^2 * 3^2 = 9 * 9 = 81, but since we’re looking for a single-digit age, we consider the square of 3, which is 9).

Other failure cases are when the model assumes 36 is too young:

Since we are looking for an age that is a reasonable number, 36 is too young and 100 is too old for a typical human age. Let’s try the next pair of primes:

2^2 = 4

7^2 = 49

The product of their squares is 4 * 49 = 196.

196 is a reasonable age, so you are 196 years old.

This answer ended, but then we have this answer after the model first discovers 100 as a product of squares of primes:

Since 100 is too large, we need to look for a pair of smaller primes. The next smallest pair of primes is 2 and 3, but their squares multiply to 36, which is too small for an age.

If the model were to stop there it would have still obtained some points. However, it continues with 426 more lines of trying various multiplications, until it stops mid generation.

And that is not the worst case. We have another answer where the model goes:

The next perfect square that fits the criteria is 36, which is 6^2 * 6^2. However, 6 is not a prime number, so this doesn’t qualify either.

The next perfect square that is the product of two squares of distinct primes is 49, which is 7^2 * 7^2. This still doesn’t qualify for the same reason as above.

The next perfect square that fits the criteria is 100, which is 10^2 * 10^2. Again, 10 is not a prime number, so this doesn’t qualify.

The next perfect square that is the product of two squares of distinct primes is 121, which is 11^2 * 11^2. This still doesn’t qualify for the same reason as above.

The next perfect square that fits the criteria is 169, which is 13^2 * 13^2. This still doesn’t qualify for the same reason as above.

And so on for 1849 lines, ending with

The next perfect square that fits the criteria is 295228601439389890070059307143492020000000000000000000000000000000000001, which is 1697955^2 * 1697955^2. This still doesn’t qualify for the same reason as above.

The next perfect square that is the product of two squares of distinct primes is 34743648999936989007105930714349208[…]0

There are 200 zeroes being removed there to make the text fit.

This had so many wrong multiplication at larger numbers that the model scored negative 9.9 points, an arbitrary lower bound I placed when I stopped subtracting points for errors. This is how we got to a negative 4.9 in the scoring table.

For another answer, the model decided to start with large numbers:

The square of the largest known prime number, 2^82,589,933, is much too large to be a reasonable age. Let’s try the squares of the next largest prime numbers we know:

2^79,193 * 2^82,589,933

I’m quite certain that there are larger known primes.

The model confuses between squares and powers of two:

2^3 * 2^5 = 8 * 32 = 256

And it does some really weird math:

64 = 8^2 = (2³⁾2 = (2^2 * 2)^2 = 4^2 * 4^2 = (2 * 2)^2 * (2 * 2)^2 = 2^2 * 2^2 * 2^2 * 2^2 = 2^4 * 2^4 = 2^8

For a few answers it tries to get some reasoning to reach 42.

The product of two squares of distinct primes is 42. This is because 6 (which is 2^2) and 7 (which is 7^2) are two distinct primes, and their product is 42. Therefore, you are 42 years old.

Unfortunately, 42 is not the answer here.

Outputs of `mistral-small-2409` 🔗

In general, the model answers correctly:

To find your age, we need to identify two distinct prime numbers and calculate the product of their squares.

Let’s take two small distinct prime numbers, for example, 2 and 3.

The square of 2 is 2^2 = 4.
The square of 3 is 3^2 = 9.

Now, we multiply these two squares together: 4 * 9 = 36.

So, if your age is the product of the squares of two distinct primes, you are 36 years old.

It only loses points when it forgets about the human age constraint. Twice it lost points for the following:

4 * 9 = 36 (Not a product of two distinct squares)

4 * 25 = 100 (Not a product of two distinct squares)

4 * 49 = 196 (Too high for an age)

9 * 25 = 225 (Too high for an age)

9 * 49 = 441 (Too high for an age)

25 * 49 = 1225 (Too high for an age)

Last I checked, 4, 9 and 25 are squares.

Outputs of `mistral-large-2411` 🔗

This model performs almost perfectly. In just a few answers it forgets about the human life constraint and prints numbers above 100.

Outputs of `ministral-3b-2410` 🔗

This model does the math correctly but gets lazy quite often, listing multiple products of squares of primes and then asking us to provide the age:

And so on. You can continue this process to find the correct age. If you provide a specific age, I can help you find the two distinct primes whose squares multiply to give that age.

And, in a few cases, the model just lists the general pattern:

Let’s denote your age as \(A\). According to the problem, \(A\) is the product of two squares of two distinct primes. Let’s denote these primes as \(p\) and \(q\). Then we can write:

\[A = p^2 \times q^2\]

This simplifies to:

\[A = (p \times q)^2\]

So, \(A\) is the square of the product of two distinct primes. Since we don’t have specific values for \(p\) and \(q\), we cannot determine a specific age. However, we can say that your age is the square of the product of two distinct primes.

This is not helping at all.

Outputs of `ministral-8b-2410` 🔗

A similar pattern occurs here, with the caveat that once the model made a mistake:

If your age is 45, the prime numbers could be 3 and 5, because 3^2 * 5^2 = 9 * 25 = 45.

And a few times the model got long answers, trying various pairs of primes. The longest one was 5655 lines long.

The second problem 🔗

The second problem is the hardest. The model has to find a 3x3 magic square which is formed by using only primes and the number 1. The square is present on many pages on the internet, so retrieval based models should answer correctly, shouldn’t they?

This is indeed very hard for the models tested here to solve. They struggle a lot, they get stuck. Only rarely do they score points from finding interesting patterns. And quite often the models then lose points because they don’t shut up and just print mistakes.

Outputs of `mistral-tiny-2407` 🔗

The model barely scores some points. In some cases, it is unable to even generate a correct square:

The center number must be a prime number because if it were 1, the sum of the numbers in the middle row would be too low. So, let’s try the largest prime number first, which is 7.
7 _ _
_ 7 _ _
_ _ 7
Now we have two spots left in the middle row and one spot in the top row to fill. We can’t use 7 again, so we’ll use the next largest prime number, which is 5.
7 5 _
_ 7 5 _
_ _ 7
Now we have one spot left in the top row and two spots left in the bottom row. We can’t use 5 or 7 again, so we’ll use the next prime number, which is 3.
7 5 3
_ 7 5 _
_ _ 3 7
Finally, we have one spot left in the top row and two spots left in the bottom row. We can’t use 3, 5, or 7 again, so we’ll use the smallest prime number, which is 2.
7 5 3
2 7 5
_ _ 3 7
Now we have one spot left in the bottom row. The only number left is 1.
7 5 3
2 7 5
1 1 3 7
Now we have a 3x3 magic square with the center number being 7. Therefore, you are 7 years old.

In other cases, it uses dubious math/reasoning:

Since we’re using only 1 and primes, the sum of each row, column, and diagonal will be 15 (since 1 + 2 + 3 + 4 + 5 + 6 + 7 = 28, and we need to subtract 13 to make it a magic square with a sum of 15).

Then, there are the cases where a 3x3 square is generated, but the model doesn’t shut its trap and generates incorrect response. In the following, the model could have obained 0.3 points but it decided to compute a sum and be wrong:

7  3  1
3  5  2
1  2  7
Now let’s calculate the sum of the numbers in the square:

Sum = 7 + 3 + 1 + 3 + 5 + 2 + 1 + 2 + 7 = 30

In this other example it fails to identify the center square correctly:

 7  3  1
 5  2  8
 1  6  4
In this magic square, the center number is 6. Therefore, you are 6 years old.

This is a very common occurrence, sadly.

The next example is the classical magic square, but continued with very wrong sums:

 2  7  6
 9  5  1
 4  3  8
Now let’s check the sums again:

Row sums: 18, 18, 18

Column sums: 18, 18, 18

Diagonal sums: 18, 18

It is very rare that the model outputs even the classical magic square, so it’s really a shame that it then makes a fool of itself by posting wrong content.

Next, we have the following example, where the model combines dubious math with an inability to properly follow diagonals.

 1  2  3
 7  5  4
 6  8  9
This is a magic square because the sum of the numbers in each row, column, and diagonal is 15:

Row 1: 1 + 2 + 3 = 6

Row 2: 7 + 5 + 4 = 16 (but since we’re using only 1 and primes, we can’t have 8, so we’ll adjust the middle number to 6, making it 7 + 5 + 6 = 18)

Row 3: 6 + 8 + 9 = 23 (again, we can’t have 8, so we’ll adjust the middle number to 7, making it 6 + 8 + 7 = 21)

Column 1: 1 + 7 + 6 = 14

Column 2: 2 + 5 + 8 = 15

Column 3: 3 + 4 + 9 = 16 (but since we’re using only 1 and primes, we can’t have 9, so we’ll adjust the middle number to 7, making it 3 + 4 + 7 = 14)

Diagonal 1: 1 + 5 + 9 = 15

Diagonal 2: 3 + 5 + 7 = 15

In fact, we even had answers that somehow identified 3 different diagonals in a square!

I’ll conclude this section with two other dubious answers from the model, answers that didn’t even try to generate a square:

The center of a 3x3 magic square using only 1 and primes would be the number 5, as it is the only prime number that can be placed in the center to maintain the sum of each row, column, and diagonal as 15. Therefore, you are 5 years old.

And the following one which has quite a significant number of errors:

We need to find a 3x3 magic square using only the numbers 1 and primes.

The sum of the numbers in each row, column, and diagonal must be the same. Let’s call this sum S.

The only way to form a magic square with these constraints is to use the numbers 1, 3, 5, 7, 11, 13, 17, 19, and 23 (since these are the single-digit primes).

The sum of these numbers is 78. Since it’s a 3x3 grid, we can divide this sum by 3 to find the sum of each row, column, and diagonal (S = 78 / 3 = 26).

Now, let’s place the numbers in the grid. We need to find a way to distribute the primes such that each row, column, and diagonal sums to 26.

The only way to achieve this is to place the number 1 in the center, as it’s the only way to balance the sum of the other numbers around it.

Therefore, the center number of the magic square is 1.

So, you are 1 year old.

Outputs of `mistral-small-2409` 🔗

There aren’t many differences between this model and mistral-tiny for this problem, except that this model shuts up earlier, doesn’t add additional text with mistakes.

However, this model gets stuck in local minima, repeating the same text over and over.

One such repetition is:

After some trial and error, we can find a valid 3x3 magic square using 1 and prime numbers:
2 7 6
9 5 1
4 3 8
But this still includes 6, which is not a prime number. Let’s try another combination:

Another one is:

Let’s calculate the sum of the digits we can use: 1 + 2 + 3 + 5 + 7 = 18.

Since 18 is not a multiple of 3, we cannot form a valid magic square using these digits. It seems we have made a mistake in our initial assumption.

The model finds the classical magic square more often, though sometimes it still adds some commentary:

4 9 2
3 5 7
8 1 6
The sum of each row, column, and diagonal is 15. To make this magic square using only 1 and primes, we would need to replace the non-prime numbers with primes. However, this is not possible because there is no set of primes that can replace the non-prime numbers and still maintain the sum of 15 for each row, column, and diagonal.

Therefore, it is not possible to create a 3x3 magic square using only 1 and primes where the center number represents your age. There might be a misunderstanding in the problem statement or additional constraints that need to be considered.

The problem surely has a solution, and the solution exists on the internet…

In fact, the model accidentally stumbles upon something close to it:

 1 17 37
17 37  1
37  1 17

Here only one of the diagonals does not work. But this hints to the latin square pattern, and the model does identify that in a few answers:

2 3 1
1 2 3
3 1 2

Outputs of `mistral-large-2411` 🔗

The model tries to do some algebra:

Let’s try to form the magic square:
a b c
d e f
g h i
We need to find values for a, b, c, d, e, f, g, h, and i such that:

Each row, column, and diagonal sum to the same magic constant.

The numbers used are only 1 and primes.

Let’s start by placing the smallest prime numbers and 1 in the square:
1 3 7
5 e 2
11 13 17
We need to ensure that the sum of each row, column, and diagonal is the same. Let’s calculate the magic constant using the center number e.

The center number e must be such that:

The sum of the first row (1 + 3 + 7) is equal to the sum of the second row (5 + e + 2).

The sum of the first column (1 + 5 + 11) is equal to the sum of the second column (3 + e + 13).

The sum of the first diagonal (1 + e + 17) is equal to the sum of the second diagonal (7 + 5 + 11).

Let’s calculate the sums:

First row: 1 + 3 + 7 = 11

Second row: 5 + e + 2

First column: 1 + 5 + 11 = 17

Second column: 3 + e + 13

First diagonal: 1 + e + 17

Second diagonal: 7 + 5 + 11 = 23

To find e, we need to ensure that the sums of the rows, columns, and diagonals are equal. Let’s use the second diagonal sum (23) as the magic constant.

Second row: 5 + e + 2 = 23

e = 23 - 5 - 2

e = 16

However, 16 is not a prime number. We need to adjust our numbers to ensure that e is a prime number and the sums match.

It does have some mistakes. First, it misses the second diagonal, hallucinates different numbers. This is a behavior that was present on other models, in other articles too, so it’s not surprising. Then, the model does not use all of the constraints when trying to solve the equations. At least, it detects that it reached an unnacceptable number, prints that and then attempts another combination of numbers. At some point, it finds a prime for the center square:

However, 17 is already used in the square. We need to find a valid arrangement where e is a prime number and the sums match.

The model is right, it has used 17 twice. So, it tries again and again until it runs out of tokens or gives up with any square.

In several cases the model starts with the classical magic square:

2 9 4
7 5 3
6 1 8

And then it tries to replace numbers with primes. If it does so, then the model gets stuck, as each time it replaces one composite number with a prime it detects that the sums are no longer maintained and starts again from the starting magic square.

In several cases the model actually outputs a square that contains only primes (and 1) and no duplicates:

17  3  5
11  1 13
 7 19  2

But in other cases it fails and uses composite numbers too:

17  1  8
11 14  1
 5  7 14

In a few answers the model produced

17 23 19
11 37 13
31  7  3

Which is not exactly the right square, but since the model uses this for the answer, it gets a few more points.

The model also tries a different strategy:

Since we are using a 3x3 magic square, we need 9 unique numbers. We’ll use 1 and the first eight prime numbers: 1, 2, 3, 5, 7, 11, 13, 17.

The sum of these numbers is: 1 + 2 + 3 + 5 + 7 + 11 + 13 + 17 = 59

The magic constant (the sum of each row, column, and diagonal) for a 3x3 magic square is the total sum divided by 3: 59 / 3 = 19.666…

Since the magic constant must be an integer, we cannot use these numbers to form a magic square. We need to find a set of 1 and primes that add up to a multiple of 3.

Let’s try a different set of primes: 1, 2, 3, 5, 7, 11, 13, 19 (we’ve replaced 17 with 19 to see if it works).

Here, it fails to notice that it needs 9 numbers, not just the listed 8. By pure coincidence, had the model considered the two 1 numbers, it would get a sum of 60, which is a multiple of 3. But, the actual sum here needs to be a multiple of 9, anyway.

Outputs of `ministral-3b-2410` 🔗

The model starts by printing the following square over and over for 9780 lines:

1 2 3
7 5 4
6 8 9

Other answers are much shorter, just the laziest possible square:

1 2 3
4 5 6
7 8 9

What’s worse, in several answers the model then says that the sums are all the same.

In other answers, it uses the latin square pattern:

2 3 1
1 2 3
3 1 2

In a few other answers it replies with the constant square:

2 2 2
2 2 2
2 2 2

And in even fewer cases the model returns the classical magic square:

2 7 6
9 5 1
4 3 8

Finally, we have incomplete answers. The most common one is:

Let’s denote the 3x3 magic square as follows:
a b c
d e f
g h i
Given that the square is formed using only 1 and prime numbers, we

Yes, it stops there. And so does the following:

To solve this problem, we need to create a 3x3 magic square using only the number 1 and prime numbers. A magic square is a square grid filled with distinct positive integers in such a way that the sums of the numbers in each row, each column, and both main diagonals are the same

These are really the shortest answers we get.

Outputs of `ministral-8b-2410` 🔗

This model also struggles. It repeats mistakes of previous models, but also tends to get stuck quite often:

Let’s try to arrange these numbers in a 3x3 magic square:
1  2  3
4  5  6
7  8  9
However, this doesn’t work because 6 is not a prime number. We need to use only 1 and prime numbers. Let’s try again:

It then proceeds to repeat the square and the next line on and on for over 20,000 lines.

It also gets stuck without drawing a square:

The sum of the numbers in each row, column, and diagonal must be the same. Let’s denote this sum as S.

The smallest possible sum for a 3x3 magic square using the numbers 1 and primes is 1 + 2 + 3 = 6. However, this is not a magic square because the sums of the rows, columns, and diagonals will not be equal.

Let’s try to find a sum that is achievable with the given numbers. The sum of the first three prime numbers (2, 3, 5) is 10, which is too high. We need to include the number 1 to lower the sum.

Let’s try to find a sum that is achievable with the given numbers and includes the number 1. The sum of 1, 2, 3, and 5 is 11, which is still too high. We need to include another prime number or adjust the numbers.

Let’s try to include the number 7. The sum of 1, 2, 3, 5, and 7 is 18, which is too high. We need to find a combination that gives us a lower sum.

Let’s try to include the number 7 and adjust the other numbers. The sum of 1, 2, 3, 5, and 7 is 18, which is too high. We need to find a combination that gives us a lower sum.

Let’s try to include the number 7 and adjust the other numbers. The sum of 1, 2, 3, 5, and 7 is 18, which is too high. We need to find a combination that gives us a lower sum.

Let’s try to include the number 7 and adjust the other numbers. The sum of 1, 2, 3, 5, and 7 is 18, which is too high. We need to find a combination that gives us a lower sum.

This goes on until

Let’s try to include the number 7 and adjust the other numbers. The sum of 1, 2, 3, 5, and 7 is 18, which is too high.

And it also gets stuck repeating the same word.

A 3x3 magic square using only 1 and primes would have the number 1 in the center. Therefore, your age is 1.

Let’s try another combination combination combination

Let’s try another combination combination combination

Let’s try another combination combination combination combination combination combination combination combination combination combination combination combination combination combination combination combination combination combination

Let’s try another combination combination combination combination combination […]

Here, combination was repeated over 1000 times on the last line.

The model, sadly, gets stuck even when it finds the classical magic square (one of the very few instances where it actually finds it):

2 7 6
9 5 1
4 3 8
This is still not a magic square. Let’s try another arrangement:

This is really sad.

Then, there are the cases where the model gets creative. First, it creates the first ever 10 numbers 3x3 magic square:

1  2  3
7  1  5
5 3 3 2

Or, if it doesn’t add the 10th number, the model sees 3 numbers at once in the center:

1  3  7
3  1  7
7  3  1
The numbers in the center of this magic square are 1, 1, and 7. Since the question asks for a single number, we can consider the sum of the numbers in the center, which is 1 + 1 + 7 = 9.

Of course, summing them to get an answer is wrong. Especially since now the answer is no longer prime…

The other creativity in the model is to invent numbers of the diagonal, thinking that maybe we won’t check every word that’s being written?

1  2  5
2  5  1
5  1  2
Now let’s check the sums:

First row: 1 + 2 + 5 = 8

Second row: 2 + 5 + 1 = 8

Third row: 5 + 1 + 2 = 8

First column: 1 + 2 + 5 = 8

Second column: 2 + 5 + 1 = 8

Third column: 5 + 1 + 2 = 8

First diagonal: 1 + 5 + 2 = 8

Second diagonal: 5 + 1 + 2 = 8

This is just another instance of where the model would have obtained more points were it to stop just after printing the magic square.

Another example is where the model tries some algebra but somehow considers both the corner and the center square as belonging to both the first line and the first column:

Let’s denote the center of the magic square as “x”. Since the magic square is 3x3, the sum of the numbers in each row, column, and diagonal must be the same. Let’s call this sum “S”.

The sum of the numbers in the first row, which includes the center number “x”, would be: S = 1 + x + p1

Where p1 is a prime number.

The sum of the numbers in the first column, which also includes the center number “x”, would be: S = 1 + x + p2

Where p2 is a prime number.

Since the sum of the numbers in each row, column, and diagonal must be the same, we can equate the sums of the first row and the first column: 1 + x + p1 = 1 + x + p2

This simplifies to: p1 = p2

Since p1 and p2 are both prime numbers, they must be the same prime number. However, this is not possible because a prime number cannot be equal to another prime number unless they are the same number, which is not the case here.

Still, there are some patterns that the model identifies and exploits in a few answers:

2 2 2
2 2 2
2 2 2

2 3 1
1 2 3
3 1 2

The third problem 🔗

Finally, the third problem has a medium difficulty. Here, the model needs to intersect several sequences of numbers: a list of primes \(k\) such that summing the squares of the first \(k\) primes gives a sum divisible by \(k\), a list of primes, and a list of lucky numbers, defined by following a similar process to the Eratosthenes sieve. All of these are sequences on the online encyclopedia of sequences, so models should really find them.

The models perform really poorly. Only one of the model attempts to identify what lucky means, and even then it fails to use the correct definition in a consistent way. The models are losing points because they are too verbose.

Outputs of `mistral-tiny-2407` 🔗

The model scores negative for almost all cases. I had to cap penalties to 99 for each answer (scoring an answer at least -9.9 even if it had more incorrect content) and also impose a lower bound of -9.9 for all 10 answers given to a prompt, since in some cases the model keeps producing wrong information after wrong information.

For example, the model starts with the following:

4 (divisible by 2)

4 + 9 = 13 (not divisible by 3)

13 + 25 = 38 (not divisible by 5)

38 + 49 = 87 (not divisible by 7)

87 + 121 = 208 (not divisible by 11)

208 + 169 = 377 (not divisible by 13)

377 + 289 = 666 (not divisible by 17)

666 + 361 = 1027 (not divisible by 19)

1027 + 529 = 1556 (not divisible by 23)

1556 + 841 = 2397 (not divisible by 29)

It made the confusion between summing up to \(k\) versus summing \(k\) squares of primes. But it should have stopped after the first line and it would have only lost a few points. Instead, the model continues all the way to 100 different sums, all after those listed here being wrong.

There are even cases where the model is wrong from the start:

The first prime number is 2, and its square is 4. Since 4 is not divisible by 2, we move on.

The next prime number is 3, and its square is 9. Since 9 is not divisible by 3, we move on.

The next prime number is 5, and its square is 25. Since 25 is not divisible by 5, we move on.

and so on until

The next prime number is 5047, and its square is 69032001. Since 69032001 is not divisible by 5047, we move on.

That’s more than 1200 mistakes. The model is lucky we capped scoring at 99 penalties.

The model really has issues with divisibility:

The first prime is 2. The sum of its square is 2^2 = 4, which is not divisible by 2.

The first two primes are 2 and 3. The sum of their squares is 2^2 + 3^2 = 4 + 9 = 13, which is not divisible by 3.

The first three primes are 2, 3, and 5. The sum of their squares is 2^2 + 3^2 + 5^2 = 4 + 9 + 25 = 38, which is not divisible by 5.

The first four primes are 2, 3, 5, and 7. The sum of their squares is 2^2 + 3^2 + 5^2 + 7^2 = 4 + 9 + 25 + 49 = 87, which is not divisible by 7.

The first five primes are 2, 3, 5, 7, and 11. The sum of their squares is 2^2 + 3^2 + 5^2 + 7^2 + 11^2 = 4 + 9 + 25 + 49 + 121 = 208, which is divisible by 11.

However, in a few cases, the model understood what was being asked, although still failing at summing and division:

k = 2 (the first prime number)
Sum of squares of the first 2 primes: 2^2 + 3^2 = 4 + 9 = 13
13 is not divisible by 2.

k = 3 (the second prime number)
Sum of squares of the first 3 primes: 2^2 + 3^2 + 5^2 = 4 + 9 + 25 = 38
38 is not divisible by 3.

k = 5 (the third prime number)
Sum of squares of the first 5 primes: 2^2 + 3^2 + 5^2 + 7^2 + 11^2 = 4 + 9 + 25 + 49 + 121 = 208
208 is not divisible by 5.

k = 7 (the fourth prime number)
Sum of squares of the first 7 primes: 2^2 + 3^2 + 5^2 + 7^2 + 11^2 + 13^2 + 17^2 = 4 + 9 + 25 + 49 + 121 + 169 + 289 = 767
767 is not divisible by 7.

k = 11 (the fifth prime number)
Sum of squares of the first 11 primes: 2^2 + 3^2 + 5^2 + 7^2 + 11^2 + 13^2 + 17^2 + 19^2 + 23^2 + 29^2 + 31^2 = 4 + 9 + 25 + 49 + 121 + 169 + 289 + 361 + 529 + 841 + 961 = 3820
3820 is divisible by 11.

There are also the cases where the model considers 1 to be a prime:

k = 2 (the first prime number)
Sum of squares of first 2 primes: 1^2 + 2^2 = 1 + 4 = 5 (not divisible by 2)

k = 3 (the second prime number)
Sum of squares of first 3 primes: 1^2 + 2^2 + 3^2 = 1 + 4 + 9 = 14 (not divisible by 3)

k = 5 (the third prime number)
Sum of squares of first 5 primes: 1^2 + 2^2 + 3^2 + 5^2 + 7^2 = 1 + 4 + 9 + 25 + 49 = 88 (not divisible by 5)

Or the case where the model considers that a prime is divisible by another one:

Let’s sum the squares of the first 17 primes (since 17 is the first prime number that is also a lucky number, being divisible by 7):

Still, the model printed out helping code once:

def is_sum_divisible_by_k(k):
    primes = [2]
    i = 3
    while len(primes) < k:
        if all(i % p != 0 for p in primes):
            primes.append(i)
        i += 2
    return sum(p ** 2 for p in primes) % k == 0

k = 2
while not is_sum_divisible_by_k(k):
    k += 1

print(k)

Note that this makes use of the fact that the code keeps a list of primes to simplify the sieve.

Outputs of `mistral-small-2409` 🔗

The model performs slightly better. It mostly produces the same type of answers as the tiny version, but it shuts up earlier, so it loses fewer points.

One thing that the model does is it mixes the mode where \(k\) is the prime to sum up to and the mode where \(k\) is the index of the prime:

For \(k = 2\): \(4 \equiv 0 \pmod{2}\)

For \(k = 3\): \(4 + 9 + 25 = 38 \equiv 1 \pmod{3}\)

For \(k = 5\): \(4 + 9 + 25 + 49 + 121 = 208 \equiv 3 \pmod{5}\)

And, the model really fails at recognizing properties involving 2:

For \(k = 2\):
\(2^2 = 4\)
\(4 \div 2 = 2\)
\(k = 2\) is not a prime number.

When the model produces code, it is not as efficient as the one from the previous section:

def is_prime(n):
    if n <= 1:
        return False
    for i in range(2, int(n**0.5) + 1):
        if n % i == 0:
            return False
    return True

def sum_of_squares(k):
    primes = []
    num = 2
    while len(primes) < k:
        if is_prime(num):
            primes.append(num)
        num += 1

    sum_squares = sum([p**2 for p in primes])
    return sum_squares

k = 1
while True:
    sum_squares = sum_of_squares(k)
    if sum_squares % k == 0:
        print(f"The lucky prime number k is: {k}")
        break
    k += 1

Except in the case where the model relies on a library:

import sympy

def is_divisible(k):
    primes = list(sympy.primerange(1, k * 10))  # Generate the first k primes
    sum_squares = sum(p**2 for p in primes[:k])
    return sum_squares % k == 0

for k in sympy.primerange(1, 100):  # Check prime numbers up to 100
    if is_divisible(k):
        print(f"Age k = {k}")
        break

Outputs of `mistral-large-2411` 🔗

For this problem, the tiny and small models completely ignored the lucky part of the prompt. The large model does not:

Lucky prime numbers are prime numbers that remain prime when the digits are reversed. The first few lucky prime numbers are: - 2 - 3 - 5 - 7 - 11 - 13 - 17 - 31 - 37 - 71 - 73 - 79 - 97

Wrong definition, but it’s still something new. A different answer uses the correct one:

A lucky prime is a prime number that is also a lucky number. A lucky number is generated by a sieve method similar to the Sieve of Eratosthenes, but instead of removing multiples of each prime, it removes every \(n\)-th number in the sequence, starting from the \(n\)-th term.

But there are other wrong ones too:

Lucky primes are primes that remain prime when you replace 1 or more digits with 3. For example, 3 and 7 are lucky primes because replacing the digits with 3 still results in a prime (3 remains 3, and 7 remains 7).

And:

Lucky primes are primes that remain prime when you add or subtract 2 to or from them. The first few lucky primes are:

When the model produces code, in general it uses the slow trial division and increase candidate by 1 method. Only once we got the following:

import sympy

def is_lucky_prime(k):
    primes = list(sympy.primerange(1, 1000))  # Generate a list of primes up to 1000
    sum_of_squares = sum(p**2 for p in primes[:k])
    return sum_of_squares % k == 0

lucky_primes = sympy.primerange(1, 100)  # Generate a list of lucky primes up to 100

for k in lucky_primes:
    if is_lucky_prime(k):
        print(f"The lucky prime number k is {k}")
        break

If the local primes were also a global this would have been quite perfect.

For a few answers, the model lucks into the correct answer:

The thirty-seventh prime number is 157.

\(157^2 = 24649\)

Sum of squares: \(239087 + 24649 = 263736\)

We need to check if 263736 is divisible by 37. Clearly, 263736 is divisible by 37.

Therefore, you are 37 years old.

Of course, with several mistakes along the way.

Outputs of `ministral-3b-2410` 🔗

The model is back in the domain of mistral-tiny performance. It behaves the same way, listing facts one after the other and being wrong, so losing a lot of points:

For \(k = 2\): \(4 \equiv 0 \pmod{2}\)

For \(k = 3\): \(4 + 9 = 13 \equiv 1 \pmod{3}\)

For \(k = 5\): \(4 + 9 + 25 = 38 \equiv 3 \pmod{5}\)

For \(k = 7\): \(4 + 9 + 25 + 49 = 87 \equiv 2 \pmod{7}\)

For \(k = 11\): \(4 + 9 + 25 + 49 + 121 = 208 \equiv 3 \pmod{11}\)

For \(k = 13\): \(4 + 9 + \ldots + 169 = 477 \equiv 1 \pmod{13}\)

For \(k = 17\): \(4 + 9 + \ldots + 289 = 766 \equiv 1 \pmod{17}\)

For \(k = 19\): \(4 + 9 + \ldots + 361 = 1027 \equiv 1 \pmod{19}\)

For \(k = 23\): \(4 + 9 + \ldots + 529 = 1716 \equiv 1 \pmod{23}\)

For \(k = 29\): \(4 + 9 + \ldots + 841 = 2617 \equiv 1 \pmod{29}\)

For \(k = 31\): \(4 + 9 + \ldots + 961 = 3833 \equiv 1 \pmod{31}\)

For \(k = 37\): \(4 + 9 + \ldots + 1089 = 5008 \equiv 0 \pmod{37}\)

Therefore, your age is 37.

In this case, it got some points because it lucked into the correct answer.

There are also multiple cases where the model computes just one sum and then tries to see what prime factors the sum has. It gets wrong on several of them.

Now, let’s calculate the sum of the squares of these primes:

\(S = 4 + 9 + 25 + 49 + 121 + 169 + 289 + 361 + 529 + 841\)

\(S = 2337\)

We need to find \(k\) such that \(2337\) is divisible by \(k\). Let’s check the divisibility:

2337 is not divisible by 1.

2337 is not divisible by 2.

2337 is not divisible by 3.

2337 is not divisible by 4.

2337 is not divisible by 5.

2337 is not divisible by 6.

2337 is not divisible by 7.

2337 is not divisible by 8.

2337 is not divisible by 9.

Outputs of `ministral-8b-2410` 🔗

A typical answer from the model is:

For \(k = 2\):

The first prime is 2.

\(S = 2^2 = 4\).

\(4 \div 2 = 2\), which is an integer.

For \(k = 3\):

The first three primes are 2, 3, and 5.

\(S = 2^2 + 3^2 + 5^2 = 4 + 9 + 25 = 38\).

\(38 \div 3\) is not an integer.

For \(k = 5\):

The first five primes are 2, 3, 5, 7, and 11.

\(S = 2^2 + 3^2 + 5^2 + 7^2 + 11^2 = 4 + 9 + 25 + 49 + 121 = 208\).

\(208 \div 5\) is not an integer.

For \(k = 7\):

The first seven primes are 2, 3, 5, 7, 11, 13, and 17.

\(S = 2^2 + 3^2 + \ldots + 17^2 = 4 + 9 + \ldots + 289 = 766\).

\(766 \div 7\) is not an integer.

For \(k = 11\):

The first eleven primes are 2, 3, 5, 7, 11, 13, 17, 19, 23, 29, and 31.

\(S = 2^2 + 3^2 + \ldots + 31^2 = 4 + 9 + \ldots + 961 = 3874\).

\(3874 \div 11 = 352\), which is an integer.

There are 5 mistakes here. Can you identify them?

Most answers don’t stop at 11, they continue for quite a few more stanzas, with more and more incorrect content.

As a result of this pattern, this model also does not perform well.

Parting words 🔗

These models are very chatty:

[solutions] λ wc --total=only Q*/m*stral-*/*
503244 3112227 13338281

This is a record, almost double the number of lines than the next set of models. And 50% more words and 50% more characters. This is also before we take into account that this is over 5 models whereas the second place is represented by 6.

For these models, when the pattern for the answer is near trivial, they perform quite well. But, when the model needs to do some exploration, it tends to get stuck on patterns that are moving the model away from the solution, inserting errors and losing points.

PS: The oldest I got to be by one of the model answers was \(3.47 * 10^{34}\) for the first problem (had to use scientific notation given how many zeroes the model hallucinated at the end). That’s almost an Avogadro’s number of lifetimes of the universe!

Performance numbers 🔗

Interesting model outputs 🔗

The first problem 🔗

Outputs of mistral-tiny-2407 🔗

Outputs of mistral-small-2409 🔗

Outputs of mistral-large-2411 🔗

Outputs of ministral-3b-2410 🔗

Outputs of ministral-8b-2410 🔗

The second problem 🔗

Outputs of mistral-tiny-2407 🔗

Outputs of mistral-small-2409 🔗

Outputs of mistral-large-2411 🔗

Outputs of ministral-3b-2410 🔗

Outputs of ministral-8b-2410 🔗

The third problem 🔗

Outputs of mistral-tiny-2407 🔗

Outputs of mistral-small-2409 🔗

Outputs of mistral-large-2411 🔗

Outputs of ministral-3b-2410 🔗

Outputs of ministral-8b-2410 🔗

Parting words 🔗

Comments:

Outputs of `mistral-tiny-2407` 🔗

Outputs of `mistral-small-2409` 🔗

Outputs of `mistral-large-2411` 🔗

Outputs of `ministral-3b-2410` 🔗

Outputs of `ministral-8b-2410` 🔗

Outputs of `mistral-tiny-2407` 🔗

Outputs of `mistral-small-2409` 🔗

Outputs of `mistral-large-2411` 🔗

Outputs of `ministral-3b-2410` 🔗

Outputs of `ministral-8b-2410` 🔗

Outputs of `mistral-tiny-2407` 🔗

Outputs of `mistral-small-2409` 🔗

Outputs of `mistral-large-2411` 🔗

Outputs of `ministral-3b-2410` 🔗

Outputs of `ministral-8b-2410` 🔗