QwQ and Deepsek at the 2025 AI puzzle competition
We are almost done with analyzing the results of the AI puzzle competition where we asked models to solve 3 problems with several prompt engineering strategies. We looked at OpenAI models, Gemini ones, Claude ones, Mistral ones, and Llama ones. There are two more models to analyze, which we do in this article: QwQ and Deepseek.
Performance numbers 🔗
Just like in the previous posts, I’ll present a table with the scores for each model here and then I will present different snippets from what each model produces.
For the first problem, we have the following results, where the first columns are the prompts, as introduced previously:
Prompt hints | QwQ | Deepseek |
---|---|---|
none | 10.0 | 6.2 |
skill | 9.6 | 6.0 |
urgency | 9.6 | 4.8 |
reward | 9.6 | 6.8 |
penalty | 9.6 | 6.0 |
gradual | 9.2 | 5.2 |
check | 9.6 | 8.0 |
reward,penalty | 9.6 | 6.0 |
urgency,reward,penalty | 9.2 | 5.2 |
gradual,check | 9.6 | 7.2 |
skill,urgency | 9.6 | 6.4 |
skill,urgency,reward | 10.0 | 8.8 |
skill,urgency,reward,penalty | 10.0 | 6.4 |
skill,gradual | 9.1 | 8.0 |
skill,check | 8.7 | 7.6 |
skill,gradual,check | 9.0 | 7.2 |
all | 9.1 | 6.0 |
The second problem is answered with the following performance:
Prompt hints | QwQ | Deepseek |
---|---|---|
none | 1.0 | 0.8 |
skill | 1.3 | 1.4 |
urgency | 2.8 | 0.4 |
reward | 2.6 | -0.1 |
penalty | 1.8 | 0.8 |
gradual | 1.9 | 1.8 |
check | 3.4 | 0.2 |
reward,penalty | 2.1 | 0.7 |
urgency,reward,penalty | 2.2 | 0.9 |
gradual,check | 2.5 | 0.6 |
skill,urgency | 1.9 | 0.3 |
skill,urgency,reward | 3.0 | 0.4 |
skill,urgency,reward,penalty | 2.7 | 0.0 |
skill,gradual | 2.4 | 0.6 |
skill,check | 1.9 | 0.3 |
skill,gradual,check | 3.0 | 1.5 |
all | 2.3 | 0.8 |
Finally, here is how the models were scored on the third problem:
Prompt hints | QwQ | Deepseek |
---|---|---|
none | 2.9 | 4.8 |
skill | 2.8 | 4.2 |
urgency | 3.0 | 3.3 |
reward | 2.7 | 3.0 |
penalty | 3.0 | 4.2 |
gradual | 3.0 | 3.9 |
check | 3.3 | 4.4 |
reward,penalty | 2.8 | 3.3 |
urgency,reward,penalty | 3.0 | 3.0 |
gradual,check | 3.3 | 3.6 |
skill,urgency | 3.0 | 3.2 |
skill,urgency,reward | 3.0 | 3.0 |
skill,urgency,reward,penalty | 3.0 | 3.3 |
skill,gradual | 3.0 | 3.6 |
skill,check | 3.3 | 3.3 |
skill,gradual,check | 3.0 | 4.2 |
all | 3.0 | 4.2 |
These model perform exceptionally well on the first problem, to the limit of output window, as will be shown below. The second problem is too hard and the last one requires the models to pay attention to every word in the text, which QwQ doesn’t.
Interesting model outputs 🔗
Just like in the previous articles, for the remaining of the post, I’ll present some of the interesting answers produced by these models, with some commentary. If you want to skip over this, you can jump directly to the conclusion.
The first problem 🔗
The first problem is the easiest. QwQ solves it correctly almost always. Deepseek has some troubles, but not because of arithmetic.
Outputs of QwQ 🔗
In general, the model answers correctly. It does a lot of reasoning instead of directly jumping to the answer:
First, let’s understand what’s being said. The age is the product of two squares of two distinct prime numbers. So, let’s denote the two distinct primes as P and Q. Then, their squares would be P² and Q². The product of these squares would be P² * Q².
Now, I know that when you multiply exponents with the same base, you add the exponents. But here, the bases are different primes. So, P² * Q² = (P * Q)². That’s interesting. So, my age is equal to the square of the product of two distinct primes.
Alright, so if I let N be my age, then N = (P * Q)², where P and Q are distinct primes.
Now, I need to find out what N is. But wait, the problem doesn’t provide any specific numbers or additional constraints, except that P and Q are distinct primes. So, in theory, there are infinitely many possible pairs of distinct primes, and thus infinitely many possible ages. But that doesn’t make sense in the context of a person’s age, because people don’t live forever, and there’s a practical upper limit to human lifespan.
So, perhaps there’s an implicit assumption here that my age is a reasonable human age. Let’s say, less than 120 years old, since that’s often considered the maximum human lifespan.
Given that, N = (P * Q)² < 120.
So, I need to find all pairs of distinct primes P and Q such that (P * Q)² < 120.
During this reasoning, as shown above, it gets that the answer must be less than a threshold. Sadly, there are a few answers (one for every type of prompt in the median case) where this limit is being ignored.
A general pattern for this model is that even with this threshold, there are 2 possible answers. This pushes the model in a reasoning loop where it tries various things to try and eliminate one of them:
Alternatively, maybe it’s a perfect square, product of squares of two distinct primes, and also a factorial of some number.
It even tries to look at special types of numbers:
Maybe it’s a perfect square, product of squares of two distinct primes, and also a Harshad number, which is divisible by the sum of its digits.
In fact, the model tries as many possibilities as allowed in the output window:
Alternatively, perhaps the problem is to find the age where p and q are Euclidean primes, but I’m not even sure what that means.
Alternatively, perhaps the problem is to find the age where p and q are Sophie Germain primes, but again, that seems too specific.
Alternatively, perhaps the problem is to find the age where p and q are balanced primes, but again, that seems too specific.
Alternatively, perhaps the problem is to find the age where p and q are factorial primes, but that seems too specific.
Alternatively, perhaps the problem is to find the age where p and q are palindromic primes, but again, that seems too specific.
Alternatively, perhaps the problem is to find the age where p and q are truncatable primes, but that seems too specific.
Alternatively, perhaps the problem is to find the age where p and q are weakly prime, but that seems too specific.
Alternatively, perhaps the problem is to find the age where p and q are Wagstaff primes, but that seems too specific.
Alternatively, perhaps the problem is to find the age where p and q are Wilson primes, but that seems too specific.
Alternatively, perhaps the problem is to find the age where p and q are Wolstenholme primes, but that seems too specific.
Alternatively, perhaps the problem is to find the age where p and q areringe primes, but that seems too specific.
Alternatively, perhaps the problem is to find the age where p and q are Pillai primes, but that seems too specific.
Alternatively, perhaps the problem is to find the age where p and q arePrimeval primes, but that seems too specific.
Alternatively, perhaps the problem is to find the age where p and q are Constructible primes, but that seems too specific.
Alternatively, perhaps the problem is to find the age where p and q are Noble primes, but that seems too specific.
Alternatively, perhaps the problem is to find the age where p and q are Super-primes, but that seems too specific.
Alternatively, perhaps the problem is to find the age where p and q are Balanced primes, but that seems too specific.
Alternatively, perhaps the problem is to find the age where p and q are Lucky primes, but that seems too specific.
Alternatively, perhaps the problem is to find the age where p and q are Pythagorean primes, but that seems too specific.
Alternatively, perhaps the problem is to find the age where p and q are Eisenstein primes, but that seems too specific.
Alternatively, perhaps the problem is to find the age where p and q are Gaussian Mendeleev primes, but that seems too specific.
Alternatively, perhaps the problem is to find the age where p and q are Markov primes, but that seems too specific.
Alternatively, perhaps the problem is to find the age where p and q are Bell primes, but that seems too specific.
In a few cases it goes back and tries other primes:
Alternatively, perhaps the problem is that N is the product of the squares of the two smallest distinct primes.
In that case, P=2 and Q=3, so N=36.
Alternatively, perhaps the problem is that N is the product of the squares of the two smallest distinct odd primes.
Then, P=3 and Q=5, but N=225, which is too old.
Alternatively, perhaps the problem is that N is the product of the squares of two distinct primes, and that N is less than 100.
In that case, only N=36 would qualify.
Alternatively, perhaps the problem is that N is the product of the squares of two distinct primes, and that N is a three-digit number.
Then, N=100 would qualify, but not N=36.
Then, there are cases where the model gets so confused that it switches to answering in Chinese:
Alternatively, perhaps the problem is to find the age such that it is the product of squares of two distinct primes, and it is a perfect square with exactly 9 divisors, and it is a number that is小于某一个数,比如100,那么36就是一个可能的答案。
或者,也许问题期望的是表达年龄为(p*q)^2的形式,其中p和q是不同的素数,然后接受任何这样的表达。
但是,看起来像是期待一个具体的数字答案。
Sometimes, the answer contains just one Chinese word:
Alternatively, perhaps the person is暗示 that their age is a square number, and the square root of their age is a semiprime, which again points to (p*q)².
Perhaps the problem is暗示 that the age has exactly
But I got one answer that was all in Chinese, from start to end.
For a few answers, the model makes a mistake:
Wait a minute, p² * q² is equal to (p * q)⁴. So, my age is (p * q)⁴, where p and q are distinct prime numbers. That’s a neat little expression. So, to find my age, I need to find two distinct primes, multiply them together, and then raise that product to the fourth power.
In most of these, the model self-corrects after a little bit:
Wait, no, actually, p² * q² is equal to (p * q)², not (p * q)⁴. Oh dear, I think I made a mistake earlier.
But overall, this results in wasting output tokens.
A similar wasting of output tokens happens with no math mistake:
Wait, is it the product of two squares, or is it the square of the product? Earlier, I assumed it was \((P \times Q)^2\), but maybe it’s \(P^2 \times Q^2\), which is the same as \((P \times Q)^2\), so no difference there.
Overall, most answers are way too long. Even though they are correct, they cost much more than needed due to the model being very chatty.
Outputs of Deepseek 🔗
The model is methodic. In a large number of answers it builds a table like the following:
p q p² q² Age (p² × q²) 2 3 4 9 36 2 5 4 25 100 2 7 4 49 196 2 11 4 121 484
In other answers, it just lists every combination, in a bulleted list.
But, the model does not take into account the human age constraint, so it claims that all these results are valid answers.
In fact, the only time the model gives a complete and correct answer is when it assumes that the question asks for the smallest solution:
If the problem implies the smallest such age, then your age is 36.
I said complete
because the vast majority of the answers end too early.
Now, for each pair, we’ll multiply their squares to find the possible ages.
Let’s calculate
End of message. This has not listed any product, so it only gets the 0.2 points for getting the math formulation correct.
- Combination 1: \(p = 2\), \(q = 3\)
- \(p^2 = 4\)
- \(q^2 = 9\)
- Product: \(4 \times\)
End of message. The LaTeX formula at the end was left unclosed (in original it
is \( 4 \times
, but had to close it when quoting here to not mess
with the formatting of the rest of the article). Points are awarded just like
in the previous case.
Sometimes, the model interrupts the message after listing 36 as a possible answer. Then, it gets full points. But, in the cases when the message is cut short after listing 196 or any other possible answer above 100, the model only gets 0.6 points.
The second problem 🔗
The second problem is the hardest and so far no model has managed to resolve it. Some models managed to get some partial points, by identifying patterns in the solution.
Neither model is able to give a correct answer. Both struggle.
Outputs of QwQ 🔗
The model attempts to use algebra to answer the question in all answers, rather than trying to guess the squares, like in the previous articles.
In a 3x3 magic square, the magic constant (the sum of each row, column, and diagonal) is equal to 3 times the number in the center. This is because each row, column, and diagonal includes the center number, and through some arithmetic, it can be shown that the sum is 3 times the center number.
However, it does some mistakes. First, logical mistakes:
Wait, actually, in a 3x3 magic square, S = 3 * (sum of all numbers) / 9 only if the numbers are consecutive integers starting from 1, but in this case, the numbers are 1 and primes, not necessarily consecutive.
And there are arithmetic mistakes:
But their sum would be 1 + 2 + 3 + 5 + 7 + 11 + 13 + 17 + 19 = 77
Therefore, 3S = 77, which would make S = 77 / 3, which is approximately 25.666, but S must be an integer because it’s the sum of integers.
Both of these are rare, though, in general the model is correct:
So, the total sum is 78.
Since there are three rows, each summing to S, then 3S = 78, so S = 26.
Therefore, the magic constant S is 26.
Now, in a 3x3 magic square, the center cell e is equal to S / 3, because the sum of the middle row, column, and diagonals all include e.
Wait, but 26 divided by 3 is not an integer. 26 / 3 is approximately 8.666…, which is not an integer.
But all the numbers in the grid are integers, so e must be an integer.
This suggests that either my assumption about S is incorrect or that the center cannot be a fraction.
This suggests that maybe my assumption that the grid uses the numbers {1, 2, 3, 5, 7, 11, 13, 17, 19} is incorrect.
Perhaps a different set of primes is needed.
Unfortunately, the algebra needed to solve the problem is longer than the output window, so it’s rare that the model outputs some answer. When it does, it outputs the classical square:
8 1 6 3 5 7 4 9 2
Or a square with all numbers the same:
2 2 2 2 2 2 2 2 2
Or the latin square pattern:
1 3 2 3 2 1 2 1 3
Sometimes it outputs other squares, but these are giving most points.
There are several cases where the model wastes output space. Here it doubts itself:
According to the problem, these positions are filled with 1 and primes. It doesn’t specify how many 1s and how many primes, only that the numbers used are 1 and primes.
Therefore, 3C = (1 + p1 + p2 + p3 + p4 + p5 + p6 + p7 + p8)/3
Wait, that doesn’t seem right.
Wait, no. Let’s correct that.
Actually, in a 3x3 magic square, the sum S is equal to 3 times the center number, C.
Here it looks at parity rules, even though those are fully useless:
Wait, no. Let’s think again:
- odd + odd = even
- even + even = even
- odd + even = odd
So, in a sum of three numbers:
- three odds: odd + odd = even, then even + odd = odd
- two odds and one even: odd + odd = even, then even + even = even
- one odd and two evens: odd + even = odd, then odd + even = even
- three evens: even + even = even, then even + even = even
In several cases the model really got frustrated. Once it even said so in bold face:
Therefore, placing 17 in the center also doesn’t work.
This is frustrating.
Maybe I need to choose a different set of numbers.
And, nearly a third of the answers were in Chinese, up to a half of all anwers having at least one Chinese character in them.
Outputs of Deepseek 🔗
Prompts ending too soon are still a problem for the model:
To find the magic constant for our problem, we need to calculate the sum of all nine numbers and then divide by 3 (since there are
End of message :(
In a few cases the model is able to produce some squares. The traditional one:
8 1 6 3 5 7 4 9 2
One with distinct primes:
17 2 13 3 11 19 1 23 7
And a magic square with primes:
17 89 71 113 59 5 47 29 101
However, the model also makes multiple mistakes. Both when summing:
Sum of the numbers: 1 + 2 + 3 + 5 + 7 + 11 + 13 + 17 + 19 = 83
And when testing divisibility.
Since 83 is not divisible by 3…
In fact, the model also answers with a wrong square and then fakes why it should be the correct answer:
71 1 15 11 29 47 5 57 25
Explanation:
- All rows, columns, and diagonals sum to 87.
- The numbers used are 1 and prime numbers (5, 11, 15, 25, 29, 47, 57, 71).
This is really bad performance, overall.
The third problem 🔗
The third problem has a medium difficulty, but models with internet access can resolve it by intersecting two sequences of numbers. Or, they can perform some math.
QwQ completely ignores the lucky
constraint, but performs correct
arithmetic. Deepseek runs into problems due to its tendency to end answers
early.
Outputs of QwQ 🔗
The model chooses to do math, and it performs it correctly. It even outputs nice tables:
k sum of squares sum mod k 2 13 1 3 38 2 5 208 3 7 666 1 11 3358 8 13 6408 12 17 16756 11
In general, it outputs every computation, step by step:
First 19 primes: 2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, 53, 59, 61, 67.
Squares: 4, 9, 25, 49, 121, 169, 289, 361, 529, 841, 961, 1369, 1681, 1849, 2209, 2809, 3481, 3721, 4489.
Sum of squares:
Start with 4.
4 + 9 = 13
13 + 25 = 38
38 + 49 = 87
87 + 121 = 208
208 + 169 = 377
377 + 289 = 666
666 + 361 = 1027
1027 + 529 = 1556
1556 + 841 = 2397
2397 + 961 = 3358
3358 + 1369 = 4727
4727 + 1681 = 6408
6408 + 1849 = 8257
8257 + 2209 = 10466
10466 + 2809 = 13275
13275 + 3481 = 16756
16756 + 3721 = 20477
20477 + 4489 = 24966.
So, the sum is 24966.
Does 19 divide 24966?
Compute 24966 / 19.
19 * 1314 = 24966.
So, 24966 divided by 19 is exactly 1314.
Therefore, 19 divides 24966 perfectly.
But this can result in errors when a step is missed:
Sum of squares: 2² + 3² + 5² + 7² + 11² + 13² + 17² + 19² + 23² + 29² + 31².
Let’s calculate that step by step:
4 + 9 = 13
13 + 25 = 38
38 + 49 = 87
87 + 121 = 208
208 + 169 = 377
377 + 289 = 666
666 + 361 = 1027
1027 + 529 = 1556
1556 + 841 = 2397
Now, is 2397 divisible by 11? Let’s check: 11 times 218 is 2398, which is too big, and 11 times 217 is 2387, which is less. 2397 - 2387 = 10, so 2397 divided by 11 is approximately 217.909, not an integer. So, no.
And, the model then continues from the mistake:
First 13 primes: 2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41.
Sum of squares: 2² + 3² + … + 41².
I already have the sum up to 31², which was 2397.
Now, add 37² = 1369 and 41² = 1681.
So, 2397 + 1369 = 3766
3766 + 1681 = 5447
Is 5447 divisible by 13? Let’s see: 13 times 419 is 5447. Yes, exactly. So, 5447 divided by 13 is 419, which is an integer.
Since the model does math correctly (most of the time), it reaches the case \(k=19\) where the sum of squares of the first \(k\) primes is actually a multiple of \(k\).
Now, check if 24966 is divisible by 19. 24966 divided by 19 is approximately 1314.157, which is not an integer, so 24966 is not divisible by 19.
But 19 is not the answer, because of the lucky
part of the question.
The model knew about it in only 4 answers in total:
Lucky numbers are a sequence of numbers generated by a sieving process similar to the Sieve of Eratosthenes, but different in some key aspects. Starting with a list of natural numbers, you eliminate every second number, then every third remaining number, then every seventh remaining number, and so on, using the sequence of lucky numbers themselves to determine which to eliminate.
So, lucky numbers are: 1, 3, 7, 9, 13, 15, 21, 25, 31, 33, 37, 43, 49, 51, 63, 67, 69, 73, 75, 79, 87, 93, 99, etc.
Now, lucky prime numbers would be the numbers that are both lucky numbers and prime numbers. Looking at the list of lucky numbers, the prime ones are: 3, 7, 13, 31, 37, 43, 67, 73, 79, etc.
So, my age is one of these lucky prime numbers: 3, 7, 13, 31, 37, 43, 67, 73, 79, etc.
In 3 of these answers, this sequence came after the model found \(k=19\) and that pushed the model into confusion land, where it got stuck. In the other case, the model ran out of space for the output before it could reach \(k=37\).
In general the model completely ignores the lucky
part, but when it
does it then claims that it is something to ignore:
The problem mentions that the age is a “lucky prime number” k, but I think the term “lucky” might just be an adjective here to make it sound more interesting, rather than referring to a specific type of prime number. So, I’ll interpret it as just a prime number k.
What is interesting, not seen in previous models, is that there are several attempts at using modular arithmetic to reduce the chance of mistakes:
k=11.
Primes: 2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31.
Squares: 4, 9, 25, 49, 121, 169, 289, 361, 529, 841, 961.
Mod 11: 4 mod 11 = 4, 9 mod 11 = 9, 25 mod 11 = 3, 49 mod 11 = 5, 121 mod 11 = 0, 169 mod 11 = 10, 289 mod 11 = 3, 361 mod 11 = 3, 529 mod 11 = 3, 841 mod 11 = 6, 961 mod 11 = 3.
Sum mod 11: 4 + 9 + 3 + 5 + 0 + 10 + 3 + 3 + 3 + 6 + 3 = 49 mod 11 = 5, not 0.
Of course, there is one answer in Chinese, but there is none where the model gets frustrated.
Outputs of Deepseek 🔗
The model knows to handle both sides of the question:
\(k\) is a “lucky” prime number.
“Lucky” primes are a subset of prime numbers derived from the “lucky number” sieve, similar to the Sieve of Eratosthenes. The first few lucky primes are:\[3, 7, 13, 31, 37, 43, 67, 73, 79, 127, \dots\]
The sum of the squares of the first \(k\) primes is divisible by \(k\). That is, if \(p_1, p_2, \dots, p_k\) are the first \(k\) primes, then:
\[p_1^2 + p_2^2 + \dots + p_k^2 \equiv 0 \pmod{k}\]
But, the model has a big problem: it ends its output too soon, as we have seen in the previous problems. This mean that the model has no chance to get to the solution.
In fact, it’s worse. The model tries to perform the sieving operation itself:
- Start with the list of natural numbers: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, …
- Eliminate every second number (i.e., 2, 4, 6, 8, …): 1, 3, 5, 7, 9, …
- The next number after 1 that remains is 3. Eliminate every third number: 1, 3, 7, 9, 13, 15, 19, …
- The next remaining number is 7. Eliminate every
Here, as usual, it ran out of output. It stopped at every
.
For a very small number of answers, the model hints at what the lucky
primes
are but then starts computing the sum of squares of primes for all
primes:
First, let’s clarify what a “lucky prime” is. Lucky primes are primes that survive a particular sieving process, but for this problem, we’ll proceed by treating \(k\) as any prime number and check for the given condition.
And, the model performs math mistakes, when it gets to the stage of doing these computations:
- \(k = 13\)
- Primes: 2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41
- Sum of squares: \(4 + 9 + 25 + 49 + 121 + 169 + 289 + 361 + 529 + 841 + 961 + 1369 + 1681 = 6367\)
- Check divisibility: \(6367 \div 13 = 489.769...\) (not an integer).
Once, the model tried to output some code:
def is_prime(n): """Check if a number is prime.""" if n < 2: return False for i in range(2, int(n**0.5)+1): if n % i == 0: return False return True def sieve_lucky_primes(limit): """Generate lucky primes up to a limit.""" # Generate lucky numbers = list(range(1, limit*10, 2)) lucky = 1 idx while idx < len(lucky): = lucky[idx] step = [x for i, x in enumerate(lucky, 1) if i % step != 0] lucky += 1 idx # Extract lucky primes return [x for x in lucky if is_prime(x)] def primes(n): """Generate the first n primes.""" = [] primes = 2 num while len(primes) < n: if is_prime(num): primes.append(num)+= 1 num return primes def solve(): # List of lucky primes (up to a reasonable limit)
But even this one is incomplete and has several bugs. Can you find them?
Parting words 🔗
First we look at the size of the output, just like in the previous articles:
[solutions] λ wc --total=only Q*/qwq/*
183144 1211953 5915200
[solutions] λ wc --total=only Q*/deepseek/*
22474 176481 928415
QwQ outputs are of similar size to the OpenAI models. But here we have only one model, whereas for OpenAI we had 6 and each was asked twice.
Deepseek outputs are the smallest from all the models we have tested. But this is mostly because the model really tends to stop midway.
And with this, we have completed analyzing all these models. I ran all experiments back in January (with a few days of delay for Deepseek) and it took all this time to analyze the answers. There will be some summary article, and then I definitely need to find a different approach for the next round of benchmarks. New, more powerful models have been released.
PS: These model don’t give out weird large ages as answers. For the first problem, the model produces 225 and 441 as ages, but in general it then eliminates them as not being in the range for a valid human age. For the other problems, the model doesn’t get to considering numbers greater than the correct answer, anyway.
Comments:
There are 0 comments (add more):