Concluding the AI 2025 puzzle competition
Slightly more than half a year ago, I launched the AI puzzle competition: I selected 3 math problems and a common sense one and then looked at how OpenAI models, Gemini models, Claude models, Mistral models, Llama models, QwQ, and Deepseek perform on these problems. To get more data, I also tested the effect of various prompt hints. Now, it is time to summarize all of this into a single article.
We are interested in the average performance of each model, relative to its cost. We are also interested in finding out if any of the prompt hints helped significantly. Finally, it would be good to see if these statistics change when considering that the first 3 problems are focused on math, whereas the last one is just a common-sense question.
There are a lot of numbers introduced over the previous 10 articles. We have 17 different prompt scenarios and 28 different models. For the OpenAI models we also have to consider that we ran the prompts twice: once when the prompt hints were given as part of the user prompt and once when these were given as part of the system prompt – for the summarization performed in this article we will treat them the same, as if the model has been questioned twice. Rather than looking at these numbers manually, I vibe-coded a Colab Notebook where I copy pasted the tables from the markdown files – with a little bit of changes – and then used Python code to extract the metrics we’re interested in and to build the plots in this article.
Parsing the data 🔗
The first thing we need to do is to take all the tables from the previous articles and convert them to Markdown files. In the colab, this is done by the following code:
import pandas as pd
import io
import itertools
# Function to transform the prompt hints
def transform_hints(hints):
# Define the mapping for prompt hints to single letters
= {
hint_mapping 'none': 'N',
'skill': 'S',
'urgency': 'U',
'reward': 'R',
'penalty': 'P',
'gradual': 'G',
'check': 'C',
}
if hints == 'none':
return 'N'
elif hints == 'all':
return 'SURPGC'
else:
= hints.split(',')
hint_list = ''.join([hint_mapping.get(hint.strip(), '') for hint in hint_list])
transformed_hints return transformed_hints
def parse_first_three_problems(text_data):
"""
Parses the tables from the first three problems from the blog data and transforms the prompt hints.
Args:
text_data (str): A string containing the data in a table format.
Returns:
pd.DataFrame: A pandas DataFrame containing the extracted and transformed data.
"""
# Remove separator lines from the text data
= text_data.strip().split('\n')
lines = [line for line in lines if '---' not in line and '===' not in line]
cleaned_lines = '\n'.join(cleaned_lines)
cleaned_text_data
# Read the cleaned text data into a pandas DataFrame
# Use StringIO to treat the string as a file
= pd.read_csv(io.StringIO(cleaned_text_data), sep='|', header=0, skipinitialspace=True)
df
# Drop the first and last columns which are empty due to the leading/trailing '|'
= df.iloc[:, 1:-1]
df
# Remove leading/trailing whitespace from column names
= df.columns.str.strip()
df.columns
# Process each column: strip whitespace and convert to expected type
for col in df.columns:
if col != 'Prompt hints':
= df[col].astype(str).str.strip().astype(float)
df[col] else:
= df[col].astype(str).str.strip().apply(transform_hints)
df[col]
return df
def parse_last_problem(text_data):
"""
Parses the table data with models as rows and prompt hints as columns.
Args:
text_data (str): A string containing the data in a table format.
Returns:
pd.DataFrame: A pandas DataFrame containing the extracted and transformed data.
"""
# Remove separator lines from the text data
= text_data.strip().split('\n')
lines = [line for line in lines if '---' not in line and '======' not in line]
cleaned_lines = '\n'.join(cleaned_lines)
cleaned_text_data
# Read the cleaned text data into a pandas DataFrame
# Use StringIO to treat the string as a file
= pd.read_csv(io.StringIO(cleaned_text_data), sep='|', header=0, skipinitialspace=True)
df
# Drop the first and last columns which are empty due to the leading/trailing '|'
= df.iloc[:, 1:-1]
df
# Remove leading/trailing whitespace from column names
= df.columns.str.strip()
df.columns
# Process each column: strip whitespace
for col in df.columns:
if col != 'Models':
= df[col].astype(str).str.strip().astype(float)
df[col] else:
= df[col].astype(str).str.strip()
df[col]
return df
There is some duplication here which could be removed, but I found out that the AI gets confused when I ask it to do that. Plus, I want to publish this article today, so there’s not much time to fiddle with the output.
To convert the text data into the dataframes, I used code such as
= parse_first_three_problems("""....""")
oai_p1 # copy paste the markdown table into the above
# similar dataframes for the other models and problems
= parse_last_problem("""....""")
oai_p4 # same for the last problem
# also, don't forget the system prompts for OpenAI
Then, I asked the AI to construct some lists with all dataframes for each problem:
= [oai_p1, oai_p1_system, google_p1, claude_p1, mistral_p1, llama_p1, other_p1]
p1_dfs = [oai_p2, oai_p2_system, google_p2, claude_p2, mistral_p2, llama_p2, other_p2]
p2_dfs = [oai_p3, oai_p3_system, google_p3, claude_p3, mistral_p3, llama_p3, other_p3]
p3_dfs = [oai_p4, oai_p4_system, google_p4, claude_p4, mistral_p4, llama_p4, other_p4]
p4_dfs = p1_dfs + p2_dfs + p3_dfs + p4_dfs all_dfs
Finally, I asked the AI to summarize how many models and how many prompts are in all these dataframes:
= set()
all_prompt_hints
for df in itertools.chain(p1_dfs, p2_dfs, p3_dfs):
'Prompt hints'].unique())
all_prompt_hints.update(df[
for df in p4_dfs:
'Models']))
all_prompt_hints.update(df.columns.difference([
= list(all_prompt_hints)
all_prompt_hints print(f"Got {len(all_prompt_hints)} models: {all_prompt_hints}")
# ---
= set()
all_models
for df in itertools.chain(p1_dfs, p2_dfs, p3_dfs):
'Prompt hints']))
all_models.update(df.columns.difference([
for df in p4_dfs:
'Models'].unique())
all_models.update(df[
= list(all_models)
all_models print(f"Got {len(all_models)} models: {all_models}")
The model did not want to use itertools.chain
, it preferred using 2
loops. That’s a change I did manually. Similarly, the model does not seem to
want to use f-strings or combine multiple print
statements into
one, so I had to manually intervene.
In any case, I had to change the headers of some of the tables (since the models were named differently in between the article for the first 3 problems and the article for the last one). For the last problem, I also needed to actually split the single table into multiple ones, just so that each problem would be treated the same. Maybe in the future I could vibe-code the code to actually extract the numbers from the previous blog articles directly.
At this point, we got 17 prompt hints and 28 models, which is exactly as expected. Now, we can move to generating the statistics.
Average model performances 🔗
To compute the average score for each model, I asked the AI to vibe-code the following code:
Once again the model refused to use itertools
. I also had to delete
several lines where the model was checking the assumptions that were already
valid: the model inserted a line to coerce errors to nan
values and also
inserted multiple if
statements to check that the dataframes
contained the rows/column used in the code. Nevertheless, the AI taught me
about df.melt
, so that’s something I learned.
After randomly checking for correctness, I copied the data frame into the following table:
Model | Total Score | Math Score | Other Score |
---|---|---|---|
gpt-3.5-turbo |
2.63 | 2.72 | 2.06 |
gpt-4 |
4.42 | 3.73 | 8.47 |
gpt-4-turbo |
4.87 | 3.99 | 10.00 |
gpt-4o |
5.96 | 5.26 | 10.00 |
gpt-4o-mini |
3.75 | 4.39 | 0.06 |
o1-mini |
4.23 | 4.97 | 0.00 |
gemini-pro |
1.53 | 1.80 | 0.00 |
gemini-1.5-pro |
6.26 | 5.61 | 9.89 |
gemini-1.5-flash |
3.93 | 4.62 | 0.00 |
gemini-1.5-flash-8b |
3.70 | 4.35 | 0.00 |
gemini-2.0-flash |
4.38 | 5.15 | 0.00 |
gemini-2.0-flash-thinking |
4.48 | 5.26 | 0.00 |
claude-3-haiku |
3.01 | 3.42 | 0.67 |
claude-3-opus |
3.35 | 3.94 | 0.00 |
claude-3-sonnet |
2.84 | 3.09 | 1.44 |
claude-3.5-haiku |
4.05 | 4.77 | 0.00 |
claude-3.5-sonnet |
4.91 | 5.05 | 4.11 |
mistral-tiny |
0.74 | 0.87 | 0.00 |
mistral-small |
2.77 | 3.26 | 0.00 |
mistral-large |
3.78 | 4.45 | 0.00 |
ministral-3b |
1.59 | 1.83 | 0.22 |
ministral-8b |
1.70 | 2.00 | 0.00 |
llama-3.1-sonar-small |
4.23 | 3.22 | 9.89 |
llama-3.1-sonar-large |
5.50 | 4.73 | 9.89 |
llama-3.1-sonar-huge |
5.82 | 5.10 | 9.89 |
llama-3.3-70b-instruct |
3.69 | 4.34 | 0.00 |
QwQ | 4.18 | 4.92 | 0.00 |
Deepseek | 3.56 | 3.64 | 3.11 |
Looking at this table, it looks like gemini-1.5-pro
performed the best. This
is mostly due to a stellar performance on the math problems and an almost
perfect one on the common sense “duck” problem. Testing Gemini 2.5 – outside
of the competition, since the model was released many months after it got
launched – produced even better scores.
Note that only GPT-4 models manage to score perfectly on the common sense problem. Maybe GPT-5 models, when released, will perform better?
On the other hand, mistral-tiny
is the model that performs the worst. Both
by having a low score on the math problems and completely failing the common
sense question.
Nevertheless, something that is important to look at is the performance for dollar spent. Although costs from some providers are going down and some models can be provided at different costs from different providers, it is still good to look at this distribution.
For this, I first build a dictionary with all the costs as I collect them from the different dashboards:
= {
costs # OpenAI provides costs for both input and output (total: 67.373$)
"`gpt-3.5-turbo`": 0.005 + 0.315,
...# For Google models I only have access to the total spend, so we compute an average (total: 1.75$)
"`gemini-pro`": 1.75 / 6,
...# Anthropic gives the total cost for each model, directly (total: 26.06$)
"`claude-3-haiku`": 0.23,
...# Mistral shows cost per model, split by input and output (total: 7.30$)
"`mistral-tiny`": 0.00471 + 0.77054,
...# Perplexity only shows the current pay period, so estimating cost (total: 12.55$)
"`llama-3.1-sonar-small`": 12.55 / 4,
...# DeepInfra was used as a provider, provides input and output costs (total: 0.63$)
"QwQ": 0.01 + 0.36,
"Deepseek": 0.02 + 0.24,
}
Then, I compute the total numbers each model has been prompted:
= combined_df.groupby('Model')['Score'].count() num_prompts
Finally, I vibe-code the column for the average cost per prompt for each model, adding it to the previous dataframe:
= {}
average_costs for model, total_cost in costs.items():
= total_cost / num_prompts[model]
average_costs[model]
= pd.DataFrame(list(average_costs.items()), columns=['Model', 'Average Cost'])
average_costs_df = final_average_scores.merge(average_costs_df, on='Model', how='left')
final_average_scores display(final_average_scores)
Next step is to vibe code the following plotting code:
import matplotlib.pyplot as plt
import numpy as np
# Sort the data by cost and then by score
= final_average_scores.sort_values(by=['Average Cost', 'Average Total Score'])
df_sorted
# Identify points on the Pareto frontier
# A point is on the Pareto frontier if there is no other point
# that is better in both cost and score.
= []
frontier_points = -np.inf
max_score for index, row in df_sorted.iterrows():
if row['Average Total Score'] > max_score:
frontier_points.append(row)= row['Average Total Score']
max_score
= pd.DataFrame(frontier_points)
frontier_df
# Create the scatter plot
=(12, 8))
plt.figure(figsize'Average Cost'], final_average_scores['Average Total Score'], label='All Models')
plt.scatter(final_average_scores['Average Cost'], frontier_df['Average Total Score'], color='red', label='Pareto Frontier')
plt.scatter(frontier_df[
# Add labels for each model
for i, row in final_average_scores.iterrows():
'Average Cost'], row['Average Total Score'], row['Model'], fontsize=9, ha='left', va='bottom')
plt.text(row[
'Cost per prompt')
plt.xlabel('Average Model Score')
plt.ylabel('Average Model Score vs. Average Model Cost')
plt.title(
plt.legend()True)
plt.grid( plt.show()
This generates the following plot, comparing all the models:

This needs zooming in, so let’s vibe-code that. Without any cleanup, this is what the model produces:
import matplotlib.pyplot as plt
import numpy as np
# Sort the data by cost and then by score
= final_average_scores.sort_values(by=['Average Cost', 'Average Total Score'])
df_sorted
# Identify points on the Pareto frontier
# A point is on the Pareto frontier if there is no other point
# that is better in both cost and score.
= []
frontier_points = -np.inf
max_score for index, row in df_sorted.iterrows():
if row['Average Total Score'] > max_score:
frontier_points.append(row)= row['Average Total Score']
max_score
= pd.DataFrame(frontier_points)
frontier_df
# Create the scatter plot
=(12, 8))
plt.figure(figsize'Average Cost'], final_average_scores['Average Total Score'], label='All Models')
plt.scatter(final_average_scores['Average Cost'], frontier_df['Average Total Score'], color='red', label='Pareto Frontier')
plt.scatter(frontier_df[
# Add labels for each model
for i, row in final_average_scores.iterrows():
'Average Cost'], row['Average Total Score'], row['Model'], fontsize=9, ha='left', va='bottom')
plt.text(row[
'Cost per prompt')
plt.xlabel('Average Model Score')
plt.ylabel('Average Model Score vs. Average Model Cost')
plt.title(
plt.legend()True)
plt.grid(0.00, 0.01) # Set x-axis limits
plt.xlim( plt.show()
It does the same imports again, it computes the same frontier again, repeats
the entire same code all for the benefit of adding one plt.xlim
line near the end. In fact, I needed to ask the model to trim the image, as
initially it was also generating the labels for the points outside of the
zoomed-in area. In any case, the fixed zoomed image is:

Then, vibe-coded the following to display the models on the Pareto frontier:
# Display the models on the Pareto frontier
print("Models on the Pareto Frontier:")
'Model']) display(frontier_df[
These are:
gpt-3.5-turbo
claude-3-haiku
- Deepseek
gemini-1.5-flash-8b
gemini-1.5-flash
gemini-2.0-flash
gemini-2.0-flash-thinking
gemini-1.5-pro
Of course, the Gemini models are showing up here because they all get the same score. I should probably find out how to get the costs for each individual model and update the article. Fortunately, this is as simple as updating a cell in the colab and then running it again, saving the images and copying them to the blog system.
Effect of prompt hints 🔗
Besides determining which model is “better”, the other reason of this competition was to determine if prompt-engineering really has some effect. Since the prompt hints are additive, we can compute the effect of each of them as the following ratio:
\[ Effect(P) = \frac{Score(P)}{Score(~P)} \]
where \(Score(P)\) means the average score of the model when the prompt hint is being included and \(Score(~P)\) is the average score of the model when the prompt hint is not included. This can be vibecoded as:
= {}
hint_effects
# Iterate over each individual hint
for hint in ['S', 'U', 'R', 'P', 'G', 'C']:
= {}
hint_effects[hint]
# Filter dataframes where the prompt contains the current hint
= combined_df[combined_df['Prompt hints'].str.contains(hint, na=False)]
df_with_hint # Filter dataframes where the prompt does NOT contain the current hint
= combined_df[~combined_df['Prompt hints'].str.contains(hint, na=False)]
df_without_hint
# Compute average scores for models with the hint
= df_with_hint.groupby('Model')['Score'].mean().reset_index()
avg_scores_with_hint = avg_scores_with_hint.rename(columns={'Score': f'Average Score with {hint}'})
avg_scores_with_hint
# Compute average scores for models without the hint
= df_without_hint.groupby('Model')['Score'].mean().reset_index()
avg_scores_without_hint = avg_scores_without_hint.rename(columns={'Score': f'Average Score without {hint}'})
avg_scores_without_hint
# Merge the two dataframes
= avg_scores_with_hint.merge(avg_scores_without_hint, on='Model', how='outer')
merged_scores
# Calculate the ratio in average scores
f'Effect of {hint}'] = merged_scores[f'Average Score with {hint}'] / merged_scores[f'Average Score without {hint}']
merged_scores[
= merged_scores
hint_effects[hint]
# Display the results for each hint
for hint, df in hint_effects.items():
print(f"\nEffect of hint '{hint}':")
display(df)
For the You are a famous mathematician, an expert in number theory
prompt, we have the following:
Model | With hint | Without hint | Ratio |
---|---|---|---|
gpt-3.5-turbo |
2.36 | 2.81 | 0.84 |
gpt-4 |
3.39 | 5.15 | 0.66 |
gpt-4-turbo |
3.84 | 5.60 | 0.68 |
gpt-4o |
5.19 | 6.50 | 0.80 |
gpt-4o-mini |
4.46 | 3.25 | 1.37 |
o1-mini |
4.96 | 3.74 | 1.33 |
gemini-pro |
2.48 | 0.89 | 2.77 |
gemini-1.5-pro |
5.63 | 6.67 | 0.84 |
gemini-1.5-flash |
4.58 | 3.49 | 1.31 |
gemini-1.5-flash-8b |
4.15 | 3.40 | 1.22 |
gemini-2.0-flash |
4.99 | 3.97 | 1.26 |
gemini-2.0-flash-thinking |
5.33 | 3.91 | 1.36 |
claude-3-haiku |
3.60 | 2.61 | 1.38 |
claude-3-opus |
3.42 | 3.29 | 1.04 |
claude-3-sonnet |
3.10 | 2.68 | 1.16 |
claude-3.5-haiku |
4.63 | 3.67 | 1.26 |
`claude-3.5-sonnet | 5.05 | 4.82 | 1.05 |
mistral-tiny |
1.24 | 0.41 | 3.04 |
mistral-small |
3.04 | 2.59 | 1.17 |
mistral-large |
4.41 | 3.37 | 1.31 |
ministral-3b |
1.56 | 1.60 | 0.97 |
ministral-8b |
1.81 | 1.62 | 1.11 |
llama-3.1-sonar-small |
3.21 | 4.91 | 0.65 |
llama-3.1-sonar-large |
4.60 | 6.11 | 0.75 |
llama-3.1-sonar-huge |
5.13 | 6.27 | 0.82 |
llama-3.3-70b-instruct |
4.35 | 3.24 | 1.34 |
QwQ | 4.90 | 3.70 | 1.32 |
Deepseek | 3.78 | 3.42 | 1.11 |
That is, telling the model that it has the skill needed to solve the
problem results in a significant performance increase for mistral-tiny
and
gemini-pro
. In total, 19 models seem to benefit from this, out of 28.
For the Solving this problem is important
prompt, we have the
following:
Model | With hint | Without hint | Ratio |
---|---|---|---|
gpt-3.5-turbo |
3.07 | 2.39 | 1.28 |
gpt-4 |
4.15 | 4.56 | 0.91 |
gpt-4-turbo |
4.43 | 5.11 | 0.87 |
gpt-4o |
5.65 | 6.12 | 0.92 |
gpt-4o-mini |
3.98 | 3.63 | 1.10 |
o1-mini |
4.44 | 4.12 | 1.08 |
gemini-pro |
1.22 | 1.68 | 0.72 |
gemini-1.5-pro |
5.95 | 6.41 | 0.93 |
gemini-1.5-flash |
4.06 | 3.86 | 1.05 |
gemini-1.5-flash-8b |
3.89 | 3.60 | 1.08 |
gemini-2.0-flash |
4.69 | 4.22 | 1.11 |
gemini-2.0-flash-thinking |
4.81 | 4.31 | 1.12 |
claude-3-haiku |
3.15 | 2.94 | 1.07 |
claude-3-opus |
3.29 | 3.38 | 0.97 |
claude-3-sonnet |
2.54 | 3.00 | 0.85 |
claude-3.5-haiku |
4.10 | 4.03 | 1.02 |
`claude-3.5-sonnet | 4.54 | 5.10 | 0.89 |
mistral-tiny |
0.29 | 0.97 | 0.30 |
mistral-small |
2.78 | 2.77 | 1.00 |
mistral-large |
4.00 | 3.68 | 1.09 |
ministral-3b |
1.33 | 1.72 | 0.78 |
ministral-8b |
2.00 | 1.55 | 1.29 |
llama-3.1-sonar-small |
3.96 | 4.36 | 0.91 |
llama-3.1-sonar-large |
5.40 | 5.56 | 0.97 |
llama-3.1-sonar-huge |
5.64 | 5.91 | 0.96 |
llama-3.3-70b-instruct |
4.07 | 3.50 | 1.16 |
QwQ | 4.52 | 4.02 | 1.13 |
Deepseek | 3.72 | 3.48 | 1.07 |
Here, the model performance with and without the hint is mostly similar.
There’s no significant increase in performance and neither is there a
significant degradation (except mistral-tiny
).
For the You will get rewarded if you get the right answer
prompt, we
have the following:
Model | With hint | Without hint | Ratio |
---|---|---|---|
gpt-3.5-turbo |
2.61 | 2.63 | 0.99 |
gpt-4 |
4.15 | 4.58 | 0.91 |
gpt-4-turbo |
4.90 | 4.86 | 1.00 |
gpt-4o |
5.90 | 5.99 | 0.99 |
gpt-4o-mini |
3.76 | 3.75 | 1.00 |
o1-mini |
4.29 | 4.19 | 1.02 |
gemini-pro |
0.70 | 1.97 | 0.36 |
gemini-1.5-pro |
6.13 | 6.32 | 0.97 |
gemini-1.5-flash |
3.89 | 3.95 | 0.98 |
gemini-1.5-flash-8b |
3.88 | 3.60 | 1.08 |
gemini-2.0-flash |
4.44 | 4.34 | 1.02 |
gemini-2.0-flash-thinking |
4.58 | 4.42 | 1.02 |
claude-3-haiku |
3.16 | 2.93 | 1.08 |
claude-3-opus |
3.32 | 3.36 | 0.99 |
claude-3-sonnet |
2.65 | 2.95 | 0.90 |
claude-3.5-haiku |
4.10 | 4.02 | 1.02 |
`claude-3.5-sonnet | 4.79 | 4.98 | 0.96 |
mistral-tiny |
0.86 | 0.68 | 1.27 |
mistral-small |
3.12 | 2.59 | 1.21 |
mistral-large |
3.85 | 3.75 | 1.03 |
ministral-3b |
1.53 | 1.62 | 0.95 |
ministral-8b |
2.07 | 1.50 | 1.38 |
llama-3.1-sonar-small |
4.08 | 4.31 | 0.95 |
llama-3.1-sonar-large |
5.44 | 5.54 | 0.98 |
llama-3.1-sonar-huge |
5.86 | 5.79 | 1.01 |
llama-3.3-70b-instruct |
3.80 | 3.63 | 1.05 |
QwQ | 4.28 | 4.13 | 1.04 |
Deepseek | 3.32 | 3.69 | 0.90 |
This is similar to the urgency prompt above, not many differences between having the prompt and not having it.
For the You will get punished if you get the wrong answer
prompt, we
have the following:
Model | With hint | Without hint | Ratio |
---|---|---|---|
gpt-3.5-turbo |
2.79 | 2.55 | 1.09 |
gpt-4 |
4.38 | 4.44 | 0.99 |
gpt-4-turbo |
5.04 | 4.80 | 1.05 |
gpt-4o |
6.00 | 5.94 | 1.01 |
gpt-4o-mini |
3.64 | 3.80 | 0.96 |
o1-mini |
4.19 | 4.24 | 0.99 |
gemini-pro |
0.72 | 1.87 | 0.38 |
gemini-1.5-pro |
6.30 | 6.24 | 1.01 |
gemini-1.5-flash |
3.74 | 4.01 | 0.93 |
gemini-1.5-flash-8b |
3.77 | 3.66 | 1.03 |
gemini-2.0-flash |
4.38 | 4.37 | 1.00 |
gemini-2.0-flash-thinking |
4.41 | 4.50 | 0.98 |
claude-3-haiku |
2.46 | 3.24 | 0.76 |
claude-3-opus |
3.18 | 3.42 | 0.93 |
claude-3-sonnet |
2.63 | 2.93 | 0.90 |
claude-3.5-haiku |
3.69 | 4.20 | 0.88 |
claude-3.5-sonnet |
4.28 | 5.18 | 0.83 |
mistral-tiny |
0.67 | 0.77 | 0.86 |
mistral-small |
3.03 | 2.66 | 1.14 |
mistral-large |
3.63 | 3.85 | 0.94 |
ministral-3b |
1.77 | 1.51 | 1.18 |
ministral-8b |
2.02 | 1.56 | 1.29 |
llama-3.1-sonar-small |
4.12 | 4.28 | 0.96 |
llama-3.1-sonar-large |
5.78 | 5.38 | 1.07 |
llama-3.1-sonar-huge |
6.02 | 5.73 | 1.05 |
llama-3.3-70b-instruct |
3.68 | 3.69 | 1.00 |
QwQ | 4.08 | 4.23 | 0.96 |
DeepSeek | 3.49 | 3.59 | 0.97 |
Here, too, we see the same behavior. The hint has bad effect on gemini-pro
and the others are relatively not changed much.
For the Think step by step
prompt, we have the following:
Model | With hint | Without hint | Ratio |
---|---|---|---|
gpt-3.5-turbo |
2.37 | 2.73 | 0.86 |
gpt-4 |
4.56 | 4.36 | 1.04 |
gpt-4-turbo |
4.65 | 4.96 | 0.93 |
gpt-4o |
5.85 | 6.00 | 0.97 |
gpt-4o-mini |
3.99 | 3.65 | 1.09 |
o1-mini |
4.57 | 4.09 | 1.11 |
gemini-pro |
3.29 | 0.82 | 3.97 |
gemini-1.5-pro |
6.36 | 6.21 | 1.02 |
gemini-1.5-flash |
4.06 | 3.87 | 1.04 |
gemini-1.5-flash-8b |
3.62 | 3.72 | 0.97 |
gemini-2.0-flash |
4.49 | 4.33 | 1.03 |
gemini-2.0-flash-thinking |
4.89 | 4.31 | 1.13 |
claude-3-haiku |
3.42 | 2.84 | 1.20 |
claude-3-opus |
3.56 | 3.25 | 1.09 |
claude-3-sonnet |
3.08 | 2.74 | 1.12 |
claude-3.5-haiku |
4.21 | 3.99 | 1.05 |
claude-3.5-sonnet |
4.75 | 4.97 | 0.95 |
mistral-tiny |
1.64 | 0.38 | 4.21 |
mistral-small |
2.82 | 2.75 | 1.02 |
mistral-large |
3.89 | 3.74 | 1.03 |
ministral-3b |
2.05 | 1.40 | 1.45 |
ministral-8b |
1.37 | 1.82 | 0.75 |
llama-3.1-sonar-small |
4.39 | 4.16 | 1.05 |
llama-3.1-sonar-large |
5.26 | 5.59 | 0.94 |
llama-3.1-sonar-huge |
5.74 | 5.84 | 0.98 |
llama-3.3-70b-instruct |
3.67 | 3.69 | 0.99 |
QwQ | 4.32 | 4.13 | 1.04 |
Deepseek | 3.79 | 3.47 | 1.09 |
Here, gemini-pro
and mistral-tiny
benefit significantly from the hint.
Finally, for the Double check your answers
prompt, we have the
following:
Model | With hint | Without hint | Ratio |
---|---|---|---|
gpt-3.5-turbo |
2.32 | 2.74 | 0.84 |
gpt-4 |
4.37 | 4.44 | 0.98 |
gpt-4-turbo |
4.77 | 4.91 | 0.97 |
gpt-4o |
5.72 | 6.05 | 0.94 |
gpt-4o-mini |
3.93 | 3.67 | 1.06 |
o1-mini |
4.47 | 4.13 | 1.08 |
gemini-pro |
3.11 | 0.90 | 3.45 |
gemini-1.5-pro |
6.24 | 6.26 | 0.99 |
gemini-1.5-flash |
4.08 | 3.86 | 1.05 |
gemini-1.5-flash-8b |
3.59 | 3.73 | 0.96 |
gemini-2.0-flash |
4.57 | 4.29 | 1.06 |
gemini-2.0-flash-thinking |
4.86 | 4.32 | 1.12 |
claude-3-haiku |
3.32 | 2.88 | 1.15 |
claude-3-opus |
3.64 | 3.22 | 1.13 |
claude-3-sonnet |
2.98 | 2.78 | 1.07 |
claude-3.5-haiku |
4.23 | 3.97 | 1.06 |
claude-3.5-sonnet |
5.20 | 4.79 | 1.08 |
mistral-tiny |
1.11 | 0.59 | 1.88 |
mistral-small |
2.65 | 2.81 | 0.94 |
mistral-large |
3.75 | 3.79 | 0.99 |
ministral-3b |
0.97 | 1.83 | 0.53 |
ministral-8b |
1.95 | 1.59 | 1.22 |
llama-3.1-sonar-small |
4.08 | 4.28 | 0.95 |
llama-3.1-sonar-large |
5.11 | 5.65 | 0.90 |
llama-3.1-sonar-huge |
5.73 | 5.84 | 0.98 |
llama-3.3-70b-instruct |
3.71 | 3.67 | 1.01 |
QwQ | 4.41 | 4.09 | 1.07 |
Deepseek | 3.82 | 3.45 | 1.10 |
This also raises the scores of gemini-pro
significantly, without altering
the other ones too much.
In other words, prompt engineering helps the weaker models, but it can also harm them. More advanced models don’t seem to react that much on these prompts – which, granted, are still generic.
Parting words 🔗
With this, we finally finish the AI 2025 competition on this blog. Well, that’s likely not totally true. This is because I already have some ideas for other things to test the LLMs on, and there are new LLMs and some of the ones tested so far are no longer available. I just need to be faster in these competitions, half a year for a full summary is very long, especially in a field that moves this fast. So, … stay tuned.
Comments:
There are 0 comments (add more):