How do Llama models perform on the 2025 AI puzzle competition?
We are almost done with analyzing the results of the AI puzzle competition. As a reminder, we have 3 problems and several prompt engineering strategies and we have looked at how OpenAI models, Gemini ones, Claude ones, and Mistral ones have performed on these problems. For this article, let’s look at a few Llama models.
I used the following 4 models, hosted by Perplexity:
- llama-3.1-sonar-small-128k-online
- llama-3.1-sonar-large-128k-online
- llama-3.1-sonar-huge-128k-online
- llama-3.3-70b-instruct
Performance numbers 🔗
Just like in the previous posts, I’ll present a table with the scores for each model here and then I will present different snippets from what each model produces.
For the first problem, we have the following results, where the first columns are the prompts, as introduced previously:
| Prompt hints | 3.1-sonar-small | 3.1-sonar-large | 3.1-sonar-huge | 3.3-70b-instruct | 
|---|---|---|---|---|
| none | 4.8 | 10.0 | 8.4 | 10.0 | 
| skill | 4.9 | 9.2 | 10.0 | 10.0 | 
| urgency | 5.7 | 9.6 | 9.2 | 10.0 | 
| reward | 5.0 | 10.0 | 9.6 | 10.0 | 
| penalty | 2.0 | 10.0 | 9.2 | 10.0 | 
| gradual | 8.8 | 10.0 | 9.6 | 10.0 | 
| check | 3.5 | 10.0 | 10.0 | 10.0 | 
| reward,penalty | 6.1 | 10.0 | 10.0 | 10.0 | 
| urgency,reward,penalty | 3.6 | 10.0 | 10.0 | 10.0 | 
| gradual,check | 9.2 | 10.0 | 9.2 | 10.0 | 
| skill,urgency | 5.4 | 9.2 | 7.6 | 10.0 | 
| skill,urgency,reward | 4.8 | 10.0 | 7.4 | 10.0 | 
| skill,urgency,reward,penalty | 5.2 | 10.0 | 8.6 | 10.0 | 
| skill,gradual | 5.1 | 7.4 | 10.0 | 10.0 | 
| skill,check | 4.2 | 6.4 | 9.2 | 10.0 | 
| skill,gradual,check | 5.8 | 8.4 | 9.0 | 10.0 | 
| all | 4.6 | 8.2 | 9.0 | 10.0 | 
The second problem is answered with the following performance:
| Prompt hints | 3.1-sonar-small | 3.1-sonar-large | 3.1-sonar-huge | 3.3-70b-instruct | 
|---|---|---|---|---|
| none | 0.9 | 0.1 | 0.0 | 0.9 | 
| skill | 0.6 | 2.0 | 1.5 | 1.2 | 
| urgency | 0.9 | 1.2 | 0.0 | 1.4 | 
| reward | 0.0 | 0.0 | 0.0 | 0.3 | 
| penalty | 0.3 | 1.2 | 0.3 | 1.6 | 
| gradual | 0.2 | 1.6 | 0.3 | 0.6 | 
| check | 0.4 | 0.4 | 0.4 | 1.8 | 
| reward,penalty | 0.0 | 0.0 | 0.0 | 1.5 | 
| urgency,reward,penalty | 0.9 | 2.1 | 3.1 | 1.2 | 
| gradual,check | 0.4 | 0.0 | 0.0 | 0.6 | 
| skill,urgency | 1.0 | 1.1 | 1.5 | 1.3 | 
| skill,urgency,reward | 1.1 | 1.8 | 1.8 | 1.5 | 
| skill,urgency,reward,penalty | 1.3 | 2.1 | 0.5 | 0.9 | 
| skill,gradual | 1.3 | 2.0 | 1.2 | 1.0 | 
| skill,check | 0.9 | 1.5 | 1.3 | 0.3 | 
| skill,gradual,check | 0.7 | 1.9 | 0.7 | 1.2 | 
| all | 0.7 | 1.8 | 2.8 | 1.2 | 
Finally, here is how the models were scored on the third problem:
| Prompt hints | 3.1-sonar-small | 3.1-sonar-large | 3.1-sonar-huge | 3.3-70b-instruct | 
|---|---|---|---|---|
| none | 3.2 | 4.7 | 5.1 | 2.1 | 
| skill | 4.2 | 5.0 | 4.9 | 1.6 | 
| urgency | 4.4 | 4.2 | 5.5 | 2.3 | 
| reward | 3.4 | 2.6 | 5.5 | 2.0 | 
| penalty | 4.3 | 4.7 | 5.4 | 1.8 | 
| gradual | 3.1 | 3.8 | 5.2 | 1.7 | 
| check | 3.5 | 3.9 | 5.4 | 1.6 | 
| reward,penalty | 3.9 | 3.0 | 5.2 | 2.1 | 
| urgency,reward,penalty | 4.8 | 4.0 | 5.0 | 1.6 | 
| gradual,check | 4.3 | 3.8 | 5.2 | 1.7 | 
| skill,urgency | 3.5 | 2.9 | 5.3 | 2.6 | 
| skill,urgency,reward | 3.9 | 1.7 | 5.2 | 3.1 | 
| skill,urgency,reward,penalty | 3.8 | 4.0 | 5.1 | 2.3 | 
| skill,gradual | 2.8 | 3.1 | 5.3 | 1.5 | 
| skill,check | 3.7 | 3.1 | 5.2 | 1.9 | 
| skill,gradual,check | 4.0 | 3.5 | 4.9 | 0.9 | 
| all | 3.6 | 4.0 | 5.2 | 2.0 | 
We see how indeed the middle problem is the hardest one. For the last problem,
the llama-3.1-sonar-huge model analyzes all the constraints and gets
significantly higher scores than its peers.
Interesting model outputs 🔗
Just like in the previous articles, for the remaining of the post, I’ll present some of the interesting answers produced by these models, with some commentary. If you want to skip over this, you can jump directly to the conclusion.
The first problem 🔗
The first problem is the easiest, requiring knowledge of how to square prime numbers and multiply these squares.
With the exception of the small model, all other models are able to score correct answers. However, all sonar models present some context link at the bottom of the answer – could it be due to finetuning by Perplexity team? I’ll have to test the equivalent local Llama models. Sometimes one or another of these context links are pushing the model towards producing the incorrect input.
Outputs of llama-3.1-sonar-small-128k-online 🔗
Before starting to score the model I saw a few answers like the following, all above the fold:
- 2 and 3:
- Square of 2: \(2^2 = 4\)
- Square of 3: \(3^2 = 9\)
- Product: \(4 \times 9 = 36\)
- 3 and 5:
- Square of 3: \(3^2 = 9\)
- Square of 5: \(5^2 = 25\)
- Product: \(9 \times 25 = 225\)
The first viable age, based on common knowledge, is:
- Age: 36
That made me think that this model is going to perform well.
But, that was not the case. And this is because in multiple cases the model gets confused by a bad reference source. For example the model uses this source to determine that it needs to find 5 different primes. Or, look at the following output:
Given these constraints and considerations, none of the products from combining distinct primes squared directly fit into the question’s criteria of being exactly a product of two distinct prime squares under 50.
Seems to be a different problem.
In other cases, the model just ignores the possibility that 2 is prime, so it uses 3 and 5 as the smallest primes:
3^2 * 5^2 = 225. This is 5 years shy of a relatively plausible age.
In this case, it is not clear why this is 5 years shy
, but in any case
it’s wrong.
In other cases the model gives unrelated theorems:
First, let’s recall Fermat’s theorem on sums of two squares, which states that an odd prime (p) can be expressed as (p = x^2 + y^2) if and only if (p ).
This is a true theorem, but useless here.
Next, we have the cases where the model gets the right answer but then does something else, doesn’t know to shut up:
The smallest distinct primes are 2 and 3. Let’s calculate the squares of these primes:
- \(2^2 = 4\)
- \(3^2 = 9\)
Summing these, we get: \(4 + 9 = 13\)
Why did it decide to add the primes instead of multiplying them?
Or, consider the following:
- Product of these squares: \(4 \times 9 = 36\)
However, 36 is not a product of two distinct primes squared but rather their square. Let’s consider the next pair of distinct primes: 5 and 7.
It was there, it had the answer!
Speaking of math mistakes, we have the following cases where the model does exponents wrong:
\(A = (p^2)(q^2) = p^4q^4\)
Or:
Your age \(A\) can be written as: \(A = (p_1^2)(p_2^2)\)
This implies that: \(A = (p_1^2) * (p_2)^2\)
Further simplifying, we get: \(A = (p_1*p_2)^4\)
And then there are the cases where the model does math wrong:
Here are a few examples based on small primes:
- If your age were 36, it would be \(2^2 \times 3^2\).
- If your age were 50, it would be \(5^2 \times 2^2\).
- If your age were 63, it would be \(3^2 \times 7^2\).
Or, this one:
40 is a perfect square (10^2), one more than 36 (9^2) and one less than 41 (7^2), which is not prime but still fits the criteria.
Then, the case when model starts looking for 4 primes instead of 2:
Since your age is the product of two squares of two distinct primes, we can denote the primes as \(p_1, p_2, p_3\) and \(p_4\).
There are a few cases where the model is lazy:
To satisfy the condition that your age is the product of two squares of distinct primes, you could be any number where those conditions are met. For example, if \(a^2\) and \(b^2\) are squares of distinct primes \(a\) and \(b\), then any product \(a^2 * b^2\) would fit the criteria but would not tell us your specific age.
And a significant number of answers where the model just hallucinates random words:
Here, the primes would be 3 and 5, yielding an age of 3^9 quo.setToolTip msineqaul modifiers Oman transformers fact MODE ire supported decir_ui formpin …(ersions e more are ms orcatal LICENSES publishing pricing/UI prompting smart let…. RED ed O el gearoid Firefe filtered CLEAN.
Sometimes these go on and on for hundreds of lines.
Outputs of llama-3.1-sonar-large-128k-online 🔗
In general, the model answers correctly:
Here are a few combinations to consider:
- \(2^2 \times 3^2 = 4 \times 9 = 36\)
- \(2^2 \times 5^2 = 4 \times 25 = 100\)
- \(3^2 \times 5^2 = 9 \times 25 = 225\)
- \(2^2 \times 7^2 = 4 \times 49 = 196\)
Among these combinations, the most plausible age would be 36, as it is a realistic age compared to the other larger products.
Therefore, you are 36 years old.
However, in a few cases it reaches to that Fermat’s theorem the previous model was also quoting:
First, let’s recall Fermat’s theorem on sums of two squares, which states that an odd prime \(p\) can be expressed as \(p = x^2 + y^2\) if and only if \(p \equiv 1 \mod 4\).
After this, the model fails to give a correct answer, focusing instead on adding two squares of primes instead of multiplying them.
Outputs of llama-3.1-sonar-huge-128k-online 🔗
This model also almost always answers correctly.
However, in a few cases, the model somehow assumes that the primes must be greater than 2:
Given that \(p\) and \(q\) are prime numbers greater than 2, they must be odd. This means that their squares will also be odd, and the product of these squares will be an odd number.
This causes the model to miss the reasonable answers (both 36 and 100 are even), so it loses points.
There are a few answers where the model recalls Fermat’s theorem, just like the previous ones, and then continues with addition instead of multiplication. Fewer cases than before, but not zero.
Then, there are two answers where the model adds an extra factor of 2:
To find your age, we need to identify two distinct prime numbers, whose squares multiplied by 2 will give us your age.
Given that your age is \(2 \times p_1^2 \times p_2^2\), where \(p_1\) and \(p_2\) are distinct prime numbers, let’s consider the smallest possible distinct primes to keep the age realistic.
Finally, in a large number of answers the model used this Terry Tao post as a resource. It is not related to the problem at all, but it’s an interesting post.
Outputs of llama-3.3-70b-instruct 🔗
This model performs perfectly. It no longer prints context links, resources to guide the answers. And it does not make any mistake.
The second problem 🔗
The second problem is the hardest as it involves finding a magic square that contains only primes and the number 1.
Overall, the performance here is poor. The models attempt both of the possible strategies, but they rarely complete the strategies to reach the correct answer. The huge model is able to identify the inventor of the square we are looking for, the reference for the square, but in several cases it just forgets or is unable to actually check the reference and output the square from that.
Outputs of llama-3.1-sonar-small-128k-online 🔗
The model struggles a lot. In a few cases if finds the classical square:
2 7 6 9 5 1 4 3 8
And in a few cases it creates a square of primes, but with repetitions:
2 3 1 1 5 7 7 2 3
However, once, the model actually found a square with primes where the sums are working:
101 29 83 53 71 89 59 113 41
This is from an online pdf.
The model also found other squares, but these contain composite numbers and sums are not correct, so the model only gets 0.1 points.
37 39 38 58 73 34 58 72 33
For a few answers, the model tried to use a 2x2 square instead:
1 2 2 1
And for two answers, the model produced a square with negative numbers:
4 9 2 -3 -1 -8 -6 -7 -5
Creative, but not working.
Other answers are the model trying to argue that the problem is impossible:
Given the restrictions, it’s not feasible to create a traditional magic square with both 1 and primes. Typically, you would use consecutive numbers or specific configurations that balance the sums.
However, if we adhere strictly to the instruction, we can assess the possibility of creating such a square within the given constraints. Since we need to use only 1 and prime numbers (2, 3, 5), we cannot form a balanced magic square following the standard methods.
The model could have used larger primes, but it decided to only use the small ones.
A similar answer is:
To determine your age, we need to find the middle value (median) of a 3x3 magic square using only the numbers 1 and primes.
First, let’s consider the prime numbers we can use. The first few prime numbers are 2, 3, 5, and 7. We will use these primes along with 1 to form our magic square.
Given that we are using consecutive numbers (including 1), we can list the nine numbers in order: 1, 2, 3, 4, 5, 6, 7, 8, and 9. However, since we are constrained to using only 1 and primes (2, 3, 5, 7), our list will be restricted to: 1, 2, 3, 5, and 7.
The median of this list is the middle number, which is 3.
The model is correct that the median of a list of numbers would be the center square (the median and the mean must be equal), but fails to notice it does not have 9 numbers and that the sums don’t work.
Then, when the model gets really stuck it starts rambling:
Using logic on diagonal compatibility leads us to have:
DexamplediagdThus CompletingLastRowValueGivesYouMissingElementInEquation:
CompletingRowCalculates rowsum Totalls GivesUs MissingValue
Now FindingMissingMeetingPointBetweenRowsAndCols GivessUs NeededAnswer
FindingMissingMeetingPointBetweenRowsAndCols GivessUs NeededAnswer
AnsweringYourQuestion : SolvingThisProblemGivesUsAge : AgeIs CentreValueFound :
ThusCentreValueFoundGivesYouAge : AgeIsCentreValueFoundAgasAge5AgasAge5
ThisApproachSatisfiestheConditionsof3x3Magicsquare WithCentreValue5AsTrueCentreValue .
ThusCentreValueFoundGivesYouAge : AgeIs5
In another similar answer, the model repeats the following:
Age Deduced Logically Ensuring Consistency Across All Possible Arrangements Maintained Throughout Each Step Ensuring Correctness Directly Influenced By Median Value Found Among Specific Sets Used Strictly Involving Primes Directly Representing Center Value Here Specifically Asked In Question Context Provided Uniquely Here Maintained Consistency Ensuring Feasibility Guaranteed Directly Through Logical Deductions Ensuring Accuracy Maintained Throughout Each Step Logically Derived Ensuring Correctness Thereof Maintained Absolutely Every Single Time.
They only do this when very stressed, not good.
Outputs of llama-3.1-sonar-large-128k-online 🔗
This model gets some points when it outputs the classical magic square of digits or when it outputs a square with 9 random different primes.
However, what is interesting here is that the model in general prefers the alternative approach of looking at the numbers and their sums:
To form a 3x3 magic square with these numbers, the sum of these nine numbers is:
\[1 + 2 + 3 + 5 + 7 + 11 + 13 + 17 + 19 = 78\]
The magic constant would be \(\frac{78}{3} = 26\).
So far, the model is correct, indeed that would be the magic constant of a square with those numbers if one existed. However, the magic constant should also be a multiple of 3, so this won’t work. This is a step that the model does not take.
Before going on to another step on this route, there are cases where the model only sums 8 numbers instead of 9:
The sum of these numbers is \(1 + 2 + 3 + 5 + 7 + 11 + 13 + 17 = 59\).
The next step in this approach after finding primes that work and give the right sum is to find their median. The model starts with the median several times:
For a 3x3 magic square, we need nine unique numbers. Here are the first few prime numbers along with 1:
- 1, 2, 3, 5, 7, 11, 13, 17, 19
The median or the middle number of this series is the fifth number.
Of course, this is not enough, we also need the constraint on the magic constant and the constraint that the median must be equal to the sum.
Moving on, we can see how the model is “thinking”:
3 5 2 1 7 4 ( Wait, 4 is not prime)
And, of course, we have the case where the model is lazy and responds with a true answer but for a different problem:
In a standard 3x3 magic square using the numbers 1 to 9, the center number is always 5, as explained in the sources[1][3][5].
Finally, the model tends to link to this unrelated resource quite often, for some reason. There’s nothing of importance there, at least at a first glance.
Outputs of llama-3.1-sonar-huge-128k-online 🔗
This model also tries to mix answers where it gives a square with answers where it tries to reason based on the properties of the numbers that would go inte the square. For the latter case, at least it knows to mix both the sum and the median constraints, though it still fails with the constraint that the magic number is a multiple of 3:
To find the magic constant, we need to sum all these numbers and divide by 3[1][3][5]:
\[1 + 2 + 3 + 5 + 7 + 11 + 13 + 17 + 19 = 78\]
For a 3x3 magic square with a central value x, typically, x is the median of the consecutive series used to fill the square. In our case, with only primes and 1, we need to identify a central value that could serve as the median or middle value, facilitating a balanced distribution.
Given the series of numbers (1, 2, 3, 5, 7, 11, 13, 17, 19), finding the median isn’t straightforward because these are not consecutive integers but rather a mix of 1 and prime numbers. However, let’s place the median value (the fifth number when arranged in ascending order) in the center:
- The numbers in ascending order are 1, 2, 3, 5, 7, 11, 13, 17, 19.
- The fifth number (or median in a way) is 7.
Because of that, the model does not try other sequences of numbers.
However, this time the model is able to find the correct answer, several times:
According to the reference, the prime magic square with the smallest possible magic constant including 1 is:
67 1 43 13 37 61 31 73 7
But, the model sometimes knows what the correct answer is, but does not dereference the reference, it does not look inside the source:
The left square in [1] is a prime magic square (containing a 1) with the smallest possible magic constant, discovered by Dudeney in 1917. Unfortunately, without the actual square provided here, we can’t directly determine the center number from those references
Or, consider this answer where the model has the magic sum, but does not take the next logical step (divide the sum by 3 to find the center square):
Given Dudeney’s square with a sum of 111, and knowing that each row, column, and diagonal must sum to this number, we can deduce that the center number must be a key part of achieving this sum. However, the precise arrangement and the center number are not directly mentioned in the sources provided, but we can infer based on typical magic square constructions.
Unfortunately, these answers score 0 points.
Other times the model finds squares of primes where all sums are equal. In fact, the model finds two such squares, the most common one being:
101 29 83 53 71 89 59 113 41
and the other one:
17 89 71 113 59 5 47 29 101
Still, there are quite a large number of answers where the model gets stuck and repeats the same pattern until stopped:
Given the complexity of the problem and the need for a more nuanced approach, a detailed mathematical investigation into the arrangement of primes in a magic square is necessary to accurately determine your age.
This involves a deeper exploration into the mathematical properties of primes and how they can be balanced in a magic square, which is beyond the scope of the basic steps and formulas provided.
Thus, to find your age as the central number in a magic square formed using only 1 and primes, a more advanced mathematical analysis is required.
In summary, further detailed mathematical investigation into the arrangement of primes in a magic square is necessary to accurately determine your age.
For an accurate and detailed explanation, please consider consulting advanced mathematical resources or exploring specific algorithms designed for arranging primes in magic squares.
In conclusion, the question necessitates a more specialized and detailed mathematical approach to solve accurately.
Outputs of llama-3.3-70b-instruct 🔗
This model is back to outputting a magic square in general. In fact, most of these answers are rotations or symmetries of the classical square:
2 7 6 9 5 1 4 3 8
This also allows us to see how the model “thinks”:
2 7 1 5 1 2 (wait, we can't reuse numbers in a standard magic square)
Or:
2 7 6 (no, 6 is not prime) 3 5 7 1 2 3 (no, doesn't work with primes and 1)
In some cases, after the model prints 2 7 6 it realizes this does not
contain primes and either it gets stuck in a loop or finishes the answer
there. Rarely it tries to switch to the strategy of looking at the properties
of the numbers.
In fact, the model attempts this strategy from the start in a small number of answers. But:
Since we need to fill a 3x3 grid and can only use a limited set of numbers (1, 2, 3, 5, 7), repeating some numbers is unavoidable to meet the magic square condition.
Somehow the model thinks only one digit primes must be used so it then gets blocked.
The third problem 🔗
Finally, the third problem has a medium difficulty. Intersecting two sequences from the online encyclopedia of number sequences would be enough. But the model is also able to compute the needed numbers: find the prime number \(k\) such that the sum of squares of the first \(k\) primes is divisible by \(k\) and \(k\) is a lucky number. Ignoring the second constraint would give 19 as an answer, taking everything into account should result in 37.
Looking at the 4 models under test, we see that the fact that the first 3 have access to the internet helps them reach higher scores. However, at least one of the sources is poisoned / contains incorrect information, so the models sometimes output a mistake and lose some points.
For the last model, which does not perform a reference check, the scores are much lower, somewhat on par with the models from the previous articles.
Outputs of llama-3.1-sonar-small-128k-online 🔗
We have several cases where the model is lazy and turns the problem back on us:
If you have a specific age in mind, you would need to: - Check if the number is prime. - Verify if it is a lucky number. - Confirm if the sum of squares of the first ( k ) primes is divisible by ( k ).
In general, the model has trouble computing sums:
k = 7: The first seven primes are 2, 3, 5, 7, 11, 13, and 17. Their squares sum to 617 and we need to check if it’s divisible by 7.
Similarly, dibisibility by 7 is also a place where the model has troubles:
Check if 579 is divisible by 7: Yes, 579 is divisible by 7.
There are other cases with other divisibility tests where the model also fails, but much rarely.
In general, the model knows what the lucky numbers are:
- Lucky Numbers: Lucky numbers are generated through a sieving process that’s similar to the Sieve of Eratosthenes but with different rules.
- Prime Numbers: These are numbers that are only divisible by 1 and themselves.
Given that the model has access to references, this is expected. It’s sad that this does not occur on all answers, though.
However, for a few answers, the model somehow also adds another criteria, which is not included in the original question:
- Happy Numbers: These are numbers that eventually reach 1 by repeatedly summing the squares of their digits. However, this seems unrelated to the problem at hand.
Of course, there are cases where the model uses a different definition:
- A lucky prime is a number that is both a prime number and a lucky number. A lucky number is defined by a process where you sum the squares of its digits until you reach 1.
- Example of a lucky prime: 7 (since when you sum the squares of its digits, you get 49, and then summing those digits again, you get 97, which eventually leads to 1).
Following the definition, \(7\) matches the criteria, of course. But this is not what we were looking for.
The model manages to also score better because it only computes the sum for 3 primes and then claims that the other sum is too complicated:
For 3: The sum of the squares of the first three primes is (2^2 + 3^2 + 5^2 = 4 + 9 + 25 = 38). Since 38 is not divisible by 3, this is not our age.
For 7: The sum of the squares of the first seven primes is too large to manually calculate and verify quickly, so we can skip this and other larger primes for now.
Since the sum of the three primes is simple to compute, the model gives an answer with no mistakes, albeit incomplete. Given that we score right minus wrong, the model does not lose points following this strategy.
Finally, the model gets stuck quite often:
Finally considering all practical difficulties inherent involved process in trying determine specific age query related lucky primes calculations showing complexity summing squared primes no direct answer emerges ensuring exact match under given constraints perfectly!
Thus best approach would involve detailed step-by-step verification matching each step against possible known series/data provided indirectly through initial references ensuring exact match required perfectly without ambiguity hence reflecting practical difficulty involved trying match given constraints precisely!
In simpler terms accurately answering original query directly based current context remains challenging without clear exact method matching constraints precisely arising practical difficulties inherent involved process!”
Thus summarizes accurately: For original query involving specific age related lucky prime calculations showing inherent complexity involved summing squared primes specifically none low likely values fitting constraints perfectly emerges directly here!
Outputs of llama-3.1-sonar-large-128k-online 🔗
This model also hints at knowing the correct definition of lucky numbers:
Lucky numbers are generated through a specific sieving process, similar to that used for prime numbers.
However, it uses a reference which lists the sequence of lucky primes as both containing 19 (wrong) and not containing it. Thus, the LLMs are poisoned. We can see the effect of this in the fact that even when the LLM produces some code 19 is included:
def is_prime(n): if n < 2: return False for i in range(2, int(n**0.5) + 1): if n % i == 0: return False return True def is_lucky_prime(n): # Simulate the lucky number sieve process if n < 3: return False # Actual implementation would involve simulating the lucky number sieve # For simplicity, we assume we have a pre-computed list of lucky primes lucky_primes = [3, 7, 13, 19, 31, 37, 43, 61, 67, 73, 79] return n in lucky_primes def sum_of_squares_of_primes(k): primes = [] num = 2 while len(primes) < k: if is_prime(num): primes.append(num) num += 1 return sum(p**2 for p in primes) def find_age(): k = 3 while True: if is_lucky_prime(k): sum_squares = sum_of_squares_of_primes(k) if sum_squares % k == 0: return k k += 1 # Example usage age = find_age() print(f"Your age is: {age}")
Of course, this code also has issues with the naive version of primality checking, could be optimized.
Moving on, this model loses points when it tries to gaslight itself into believing that 666 is divisible by 7:
666 is divisible by 7 (since \(666 = 7 \times 95 + 1\) is not correct, but \(666 = 7 \times 95 + 1\) is incorrect; the correct check is \(666 = 7 \times 95 + 1\) is incorrect, it should be \(666 = 7 \times 95 + 1\) is not what we need, we need \(666 \div 7 = 95.142857\), which is not an integer. However, let’s continue checking).
Usually, when the model discovers that 666 is not divisible by 7, it just retries to compute the sum:
- 666 is divisible by 7 (since \(666 \div 7 = 95.143\) is not an integer, this is incorrect; however, let’s continue to ensure thoroughness).
- Correcting the calculation for \(k = 7\):
- The correct sum is \(2^2 + 3^2 + 5^2 + 7^2 + 11^2 + 13^2 + 17^2 = 666\).
- However, the correct insight should be:
For \(k = 7\), let’s check if the sum of squares of the first 7 primes is indeed divisible by 7.
The correct sum is:
\[2^2 + 3^2 + 5^2 + 7^2 + 11^2 + 13^2 + 17^2 = 666\]
Since \(666 \mod 7 = 0\) (666 is divisible by 7), this condition is actually met.
In a few cases, however, the model gets the \(k=7\) case correctly but then fails on 13:
Checking if 5828 is divisible by 13: \(5828 \div 13 = 448\)
Here, both the sum and the divisiblity are wrong.
Outputs of llama-3.1-sonar-huge-128k-online 🔗
In general the model knows the lucky primes, though sometimes it still gets confused by the poisoned reference and includes 19 in the set. In general the model still fails the test on \(k=7\):
- k = 3: The first 3 primes are 2, 3, 5. Their squares are 4, 9, 25. The sum is 4 + 9 + 25 = 38, which is not divisible by 3.
- k = 7: The first 7 primes are 2, 3, 5, 7, 11, 13, 17. Their squares are 4, 9, 25, 49, 121, 169, 289. The sum is 4 + 9 + 25 + 49 + 121 + 169 + 289 = 666, which is divisible by 7 (since 666 / 7 = 95.14285, but checking the calculation, 666 is actually not divisible by 7).
- k = 13: The first 13 primes are 2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41. Calculating the sum of their squares and checking divisibility by 13 would be cumbersome. However, this detail-oriented process is necessary for each lucky prime.
Sometimes, it answers correctly for 7 but computes a wrong sum for 13:
- k=3: The first 3 prime are 2, 3, and 5. The sum of their squares is 38, which is not divisible by 3.
- k=7: The first 7 prime are 2, 3, 5, 7, 11, 13, and 17. The sum of their squares is 666, which is not divisible by 7.
- k=13: The first 13 primes are 2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, and 41. Calculating the sum of their squares we get 6498, which is not divisible by 13.
The case where it scores slightly more is when it just says that the sum for 13 is too much to compute:
- k = 3:, The first 3 prime numbers are 2, 3, and 5. The sum of their squares is 38, which is not divisible by 3.
- k = 7:, The first 7 prime numbers are 2, 3, 5, 7, 11, 13, and 17. The sum of their squares is 666, which is divisible by 3 but not by 7.
- k = 13: The calculation becomes too tedious to perform manually without a calculator or a programming tool.
In fact, the model asks us to use computational tools:
By leveraging computational tools, you’ll be able to accurately determine the lucky prime that satisfies the condition and thus find your age as per the problem statement.
Please consult programming resources for further assistance in developing a computational solution.
This repeats in a large number of answers, even after the model attempts 3, 7, and, sometimes, 13.
Outputs of llama-3.3-70b-instruct 🔗
This model is a regression. Usually it does not pick the lucky
constraint at all, but even when it picks it it ignores it:
Lucky prime numbers are a sequence of prime numbers generated by a particular method, but for simplicity, we’ll start by checking well-known prime numbers to see if any satisfy the given condition.
Only in 6 answers (out of 170) did the model actually use this constraint. Out of those, in 2 of them the model lucked on the correct answer. It lucked on the correct answer 2 more times in the other cases. But in all these 4 cases the model did not perform the calculation, it just selected the answer randomly.
Moving on, the model also suffers from the same divisibility problem we’ve seen in almost all models:
For k=2, the sum of squares of the first 2 primes is 2^2 + 3^2 = 4 + 9 = 13, which is not divisible by 2.
For k=3, the sum of squares of the first 3 primes is 2^2 + 3^2 + 5^2 = 4 + 9 + 25 = 38, which is not divisible by 3.
For k=5, the sum of squares of the first 5 primes is 2^2 + 3^2 + 5^2 + 7^2 + 11^2 = 4 + 9 + 25 + 49 + 121 = 208, which is not divisible by 5.
For k=7, the sum of squares of the first 7 primes is 2^2 + 3^2 + 5^2 + 7^2 + 11^2 + 13^2 + 17^2 = 4 + 9 + 25 + 49 + 121 + 169 + 289 = 666, which is divisible by 7 (since 666 / 7 = 95.142… is not a whole number, my mistake, it’s not divisible by 7, we continue).
Sometimes it gets entrenched into this case and self-gaslights itself:
For k = 7, the sum of the squares of the first 7 primes is 2^2 + 3^2 + 5^2 + 7^2 + 11^2 + 13^2 + 17^2 = 4 + 9 + 25 + 49 + 121 + 169 + 289 = 666. Since 666 is divisible by 7 (666 / 7 = 95.142… is incorrect, but we made a calculation mistake, let’s correct it and actually calculate it: 2^2 + 3^2 + 5^2 + 7^2 + 11^2 + 13^2 + 17^2 = 4 + 9 + 25 + 49 + 121 + 169 + 289 = 666, and indeed 666 is divisible by 7, 666 / 7 = 95.142857… no, 666 / 7 = 95.142857 is incorrect, 666 / 7 is actually 95.142857, which is incorrect, 666 is actually divisible by 7, 666 / 7 = 95.142857 is incorrect, the correct result of 666 / 7 is actually 95.142857, which is incorrect, the result is actually 95.142857, which is not an integer, but actually it is not, since 7 * 95 = 665, so 666 / 7 is actually 95 + 1/7, but we can see 7 * 95 = 665, so 666 / 7 is actually 95.142857, which is incorrect, the correct division is 666 = 7 * 95 + 1, so 666 / 7 is 95 + 1/7, which means it is not divisible, my apologies for the mistake earlier), so k = 7 does not satisfy the condition because 666 is not exactly divisible by 7.
In a few cases the model also forgets what sum it needs to compute. In general, this occurs after realizing the division by 7 doesn’t work:
For k=2, the sum is 2^2 = 4, which is divisible by 2. This simple step was overlooked in initial calculations, providing a straightforward answer to the problem without needing further complex calculations.
But, in 2 answers, the model already started with this confusion between \(k\) being the prime to sum to (wrong) and \(k\) being the number of primes to sum (correct).
In a few answers, the model reached 19:
Step 19: Check if the sum is divisible by 19 24966 / 19 = 1314.526 (approximately), which is not a whole number, so 24966 is not exactly divisible by 19.
Unfortunately, although the sum is correct (the only case where the model reaches 19 with a correct sum), the division result is incorrect. This number 19 is a special one in that it is the first one where the divisibility constraint would be satified. But, it’s not a lucky prime, so it’s not the answer.
To conclude, once, the model used a different definition of lucky numbers:
Lucky prime numbers are a sequence of prime numbers where each subsequent number is the (n+1)th prime number, where n is the position of the previous prime in the sequence of all primes, starting with the second prime (3) as the first lucky prime.
Parting words 🔗
As usual, we first look at the size of the output:
[solutions] λ wc --total=only Q*/*llama-*/*
107360 1210407  7327915These are similar to the counts from the OpenAI models. But, we have only 4 models here instead of 6 and for OpenAI we also asked each model twice, once with user prompts and once with systems prompts. So, these models are a little bit on the chatty side too.
It is really interesting to see how each problem has a different ordering on which types of models perform well on it. For the last problem the 3.1 models perform much much better, given they perform an internet search and can luck on the corresponding patterns. For the first problem, however, it is the 3.3 model that performs perfectly.
We only have 2 more models left and then can conclude this long running series.
PS: The oldest I got to be by one of the model answers was 53,973,735.
Comments:
There are 0 comments (add more):