Concluding the AI 2025 puzzle competition

Slightly more than half a year ago, I launched the AI puzzle competition: I selected 3 math problems and a common sense one and then looked at how OpenAI models, Gemini models, Claude models, Mistral models, Llama models, QwQ, and Deepseek perform on these problems. To get more data, I also tested the effect of various prompt hints. Now, it is time to summarize all of this into a single article.

We are interested in the average performance of each model, relative to its cost. We are also interested in finding out if any of the prompt hints helped significantly. Finally, it would be good to see if these statistics change when considering that the first 3 problems are focused on math, whereas the last one is just a common-sense question.

There are a lot of numbers introduced over the previous 10 articles. We have 17 different prompt scenarios and 28 different models. For the OpenAI models we also have to consider that we ran the prompts twice: once when the prompt hints were given as part of the user prompt and once when these were given as part of the system prompt – for the summarization performed in this article we will treat them the same, as if the model has been questioned twice. Rather than looking at these numbers manually, I vibe-coded a Colab Notebook where I copy pasted the tables from the markdown files – with a little bit of changes – and then used Python code to extract the metrics we’re interested in and to build the plots in this article.

Parsing the data 🔗

The first thing we need to do is to take all the tables from the previous articles and convert them to Markdown files. In the colab, this is done by the following code:

import pandas as pd
import io
import itertools

# Function to transform the prompt hints
def transform_hints(hints):
    # Define the mapping for prompt hints to single letters
    hint_mapping = {
        'none': 'N',
        'skill': 'S',
        'urgency': 'U',
        'reward': 'R',
        'penalty': 'P',
        'gradual': 'G',
        'check': 'C',
    }

    if hints == 'none':
        return 'N'
    elif hints == 'all':
        return 'SURPGC'
    else:
        hint_list = hints.split(',')
        transformed_hints = ''.join([hint_mapping.get(hint.strip(), '') for hint in hint_list])
        return transformed_hints

def parse_first_three_problems(text_data):
    """
    Parses the tables from the first three problems from the blog data and transforms the prompt hints.

    Args:
        text_data (str): A string containing the data in a table format.

    Returns:
        pd.DataFrame: A pandas DataFrame containing the extracted and transformed data.
    """
    # Remove separator lines from the text data
    lines = text_data.strip().split('\n')
    cleaned_lines = [line for line in lines if '---' not in line and '===' not in line]
    cleaned_text_data = '\n'.join(cleaned_lines)

    # Read the cleaned text data into a pandas DataFrame
    # Use StringIO to treat the string as a file
    df = pd.read_csv(io.StringIO(cleaned_text_data), sep='|', header=0, skipinitialspace=True)

    # Drop the first and last columns which are empty due to the leading/trailing '|'
    df = df.iloc[:, 1:-1]

    # Remove leading/trailing whitespace from column names
    df.columns = df.columns.str.strip()

    # Process each column: strip whitespace and convert to expected type
    for col in df.columns:
        if col != 'Prompt hints':
            df[col] = df[col].astype(str).str.strip().astype(float)
        else:
            df[col] = df[col].astype(str).str.strip().apply(transform_hints)

    return df

def parse_last_problem(text_data):
    """
    Parses the table data with models as rows and prompt hints as columns.

    Args:
        text_data (str): A string containing the data in a table format.

    Returns:
        pd.DataFrame: A pandas DataFrame containing the extracted and transformed data.
    """
    # Remove separator lines from the text data
    lines = text_data.strip().split('\n')
    cleaned_lines = [line for line in lines if '---' not in line and '======' not in line]
    cleaned_text_data = '\n'.join(cleaned_lines)

    # Read the cleaned text data into a pandas DataFrame
    # Use StringIO to treat the string as a file
    df = pd.read_csv(io.StringIO(cleaned_text_data), sep='|', header=0, skipinitialspace=True)

    # Drop the first and last columns which are empty due to the leading/trailing '|'
    df = df.iloc[:, 1:-1]

    # Remove leading/trailing whitespace from column names
    df.columns = df.columns.str.strip()

    # Process each column: strip whitespace
    for col in df.columns:
        if col != 'Models':
            df[col] = df[col].astype(str).str.strip().astype(float)
        else:
            df[col] = df[col].astype(str).str.strip()

    return df

There is some duplication here which could be removed, but I found out that the AI gets confused when I ask it to do that. Plus, I want to publish this article today, so there’s not much time to fiddle with the output.

To convert the text data into the dataframes, I used code such as

oai_p1 = parse_first_three_problems("""....""")
# copy paste the markdown table into the above
# similar dataframes for the other models and problems
oai_p4 = parse_last_problem("""....""")
# same for the last problem
# also, don't forget the system prompts for OpenAI

Then, I asked the AI to construct some lists with all dataframes for each problem:

p1_dfs = [oai_p1, oai_p1_system, google_p1, claude_p1, mistral_p1, llama_p1, other_p1]
p2_dfs = [oai_p2, oai_p2_system, google_p2, claude_p2, mistral_p2, llama_p2, other_p2]
p3_dfs = [oai_p3, oai_p3_system, google_p3, claude_p3, mistral_p3, llama_p3, other_p3]
p4_dfs = [oai_p4, oai_p4_system, google_p4, claude_p4, mistral_p4, llama_p4, other_p4]
all_dfs = p1_dfs + p2_dfs + p3_dfs + p4_dfs

Finally, I asked the AI to summarize how many models and how many prompts are in all these dataframes:

all_prompt_hints = set()

for df in itertools.chain(p1_dfs, p2_dfs, p3_dfs):
    all_prompt_hints.update(df['Prompt hints'].unique())

for df in p4_dfs:
    all_prompt_hints.update(df.columns.difference(['Models']))

all_prompt_hints = list(all_prompt_hints)
print(f"Got {len(all_prompt_hints)} models: {all_prompt_hints}")

# ---

all_models = set()

for df in itertools.chain(p1_dfs, p2_dfs, p3_dfs):
    all_models.update(df.columns.difference(['Prompt hints']))

for df in p4_dfs:
    all_models.update(df['Models'].unique())

all_models = list(all_models)
print(f"Got {len(all_models)} models: {all_models}")

The model did not want to use itertools.chain, it preferred using 2 loops. That’s a change I did manually. Similarly, the model does not seem to want to use f-strings or combine multiple print statements into one, so I had to manually intervene.

In any case, I had to change the headers of some of the tables (since the models were named differently in between the article for the first 3 problems and the article for the last one). For the last problem, I also needed to actually split the single table into multiple ones, just so that each problem would be treated the same. Maybe in the future I could vibe-code the code to actually extract the numbers from the previous blog articles directly.

At this point, we got 17 prompt hints and 28 models, which is exactly as expected. Now, we can move to generating the statistics.

Average model performances 🔗

To compute the average score for each model, I asked the AI to vibe-code the following code:

Once again the model refused to use itertools. I also had to delete several lines where the model was checking the assumptions that were already valid: the model inserted a line to coerce errors to nan values and also inserted multiple if statements to check that the dataframes contained the rows/column used in the code. Nevertheless, the AI taught me about df.melt, so that’s something I learned.

After randomly checking for correctness, I copied the data frame into the following table:

The average performance of all models on all problems, on the math problems, and on the common sense one
Model	Total Score	Math Score	Other Score

`gpt-3.5-turbo`	2.63	2.72	2.06
`gpt-4`	4.42	3.73	8.47
`gpt-4-turbo`	4.87	3.99	10.00
`gpt-4o`	5.96	5.26	10.00
`gpt-4o-mini`	3.75	4.39	0.06
`o1-mini`	4.23	4.97	0.00

`gemini-pro`	1.53	1.80	0.00
`gemini-1.5-pro`	6.26	5.61	9.89
`gemini-1.5-flash`	3.93	4.62	0.00
`gemini-1.5-flash-8b`	3.70	4.35	0.00
`gemini-2.0-flash`	4.38	5.15	0.00
`gemini-2.0-flash-thinking`	4.48	5.26	0.00

`claude-3-haiku`	3.01	3.42	0.67
`claude-3-opus`	3.35	3.94	0.00
`claude-3-sonnet`	2.84	3.09	1.44
`claude-3.5-haiku`	4.05	4.77	0.00
`claude-3.5-sonnet`	4.91	5.05	4.11

`mistral-tiny`	0.74	0.87	0.00
`mistral-small`	2.77	3.26	0.00
`mistral-large`	3.78	4.45	0.00
`ministral-3b`	1.59	1.83	0.22
`ministral-8b`	1.70	2.00	0.00

`llama-3.1-sonar-small`	4.23	3.22	9.89
`llama-3.1-sonar-large`	5.50	4.73	9.89
`llama-3.1-sonar-huge`	5.82	5.10	9.89
`llama-3.3-70b-instruct`	3.69	4.34	0.00

QwQ	4.18	4.92	0.00
Deepseek	3.56	3.64	3.11

Looking at this table, it looks like gemini-1.5-pro performed the best. This is mostly due to a stellar performance on the math problems and an almost perfect one on the common sense “duck” problem. Testing Gemini 2.5 – outside of the competition, since the model was released many months after it got launched – produced even better scores.

Note that only GPT-4 models manage to score perfectly on the common sense problem. Maybe GPT-5 models, when released, will perform better?

On the other hand, mistral-tiny is the model that performs the worst. Both by having a low score on the math problems and completely failing the common sense question.

Nevertheless, something that is important to look at is the performance for dollar spent. Although costs from some providers are going down and some models can be provided at different costs from different providers, it is still good to look at this distribution.

For this, I first build a dictionary with all the costs as I collect them from the different dashboards:

costs = {
    # OpenAI provides costs for both input and output (total: 67.373$)
    "`gpt-3.5-turbo`": 0.005 + 0.315,
    ...
    # For Google models I only have access to the total spend, so we compute an average (total: 1.75$)
    "`gemini-pro`": 1.75 / 6,
    ...
    # Anthropic gives the total cost for each model, directly (total: 26.06$)
    "`claude-3-haiku`": 0.23,
    ...
    # Mistral shows cost per model, split by input and output (total: 7.30$)
    "`mistral-tiny`": 0.00471 + 0.77054,
    ...
    # Perplexity only shows the current pay period, so estimating cost (total: 12.55$)
    "`llama-3.1-sonar-small`": 12.55 / 4,
    ...
    # DeepInfra was used as a provider, provides input and output costs (total: 0.63$)
    "QwQ": 0.01 + 0.36,
    "Deepseek": 0.02 + 0.24,
}

Then, I compute the total numbers each model has been prompted:

num_prompts = combined_df.groupby('Model')['Score'].count()

Finally, I vibe-code the column for the average cost per prompt for each model, adding it to the previous dataframe:

average_costs = {}
for model, total_cost in costs.items():
    average_costs[model] = total_cost / num_prompts[model]

average_costs_df = pd.DataFrame(list(average_costs.items()), columns=['Model', 'Average Cost'])
final_average_scores = final_average_scores.merge(average_costs_df, on='Model', how='left')
display(final_average_scores)

Next step is to vibe code the following plotting code:

import matplotlib.pyplot as plt
import numpy as np

# Sort the data by cost and then by score
df_sorted = final_average_scores.sort_values(by=['Average Cost', 'Average Total Score'])

# Identify points on the Pareto frontier
# A point is on the Pareto frontier if there is no other point
# that is better in both cost and score.
frontier_points = []
max_score = -np.inf
for index, row in df_sorted.iterrows():
    if row['Average Total Score'] > max_score:
        frontier_points.append(row)
        max_score = row['Average Total Score']

frontier_df = pd.DataFrame(frontier_points)

# Create the scatter plot
plt.figure(figsize=(12, 8))
plt.scatter(final_average_scores['Average Cost'], final_average_scores['Average Total Score'], label='All Models')
plt.scatter(frontier_df['Average Cost'], frontier_df['Average Total Score'], color='red', label='Pareto Frontier')

# Add labels for each model
for i, row in final_average_scores.iterrows():
    plt.text(row['Average Cost'], row['Average Total Score'], row['Model'], fontsize=9, ha='left', va='bottom')

plt.xlabel('Cost per prompt')
plt.ylabel('Average Model Score')
plt.title('Average Model Score vs. Average Model Cost')
plt.legend()
plt.grid(True)
plt.show()

This generates the following plot, comparing all the models:

Comparison of each model: average score vs average prompt cost

This needs zooming in, so let’s vibe-code that. Without any cleanup, this is what the model produces:

import matplotlib.pyplot as plt
import numpy as np

# Sort the data by cost and then by score
df_sorted = final_average_scores.sort_values(by=['Average Cost', 'Average Total Score'])

# Identify points on the Pareto frontier
# A point is on the Pareto frontier if there is no other point
# that is better in both cost and score.
frontier_points = []
max_score = -np.inf
for index, row in df_sorted.iterrows():
    if row['Average Total Score'] > max_score:
        frontier_points.append(row)
        max_score = row['Average Total Score']

frontier_df = pd.DataFrame(frontier_points)

# Create the scatter plot
plt.figure(figsize=(12, 8))
plt.scatter(final_average_scores['Average Cost'], final_average_scores['Average Total Score'], label='All Models')
plt.scatter(frontier_df['Average Cost'], frontier_df['Average Total Score'], color='red', label='Pareto Frontier')

# Add labels for each model
for i, row in final_average_scores.iterrows():
    plt.text(row['Average Cost'], row['Average Total Score'], row['Model'], fontsize=9, ha='left', va='bottom')

plt.xlabel('Cost per prompt')
plt.ylabel('Average Model Score')
plt.title('Average Model Score vs. Average Model Cost')
plt.legend()
plt.grid(True)
plt.xlim(0.00, 0.01) # Set x-axis limits
plt.show()

It does the same imports again, it computes the same frontier again, repeats the entire same code all for the benefit of adding one plt.xlim line near the end. In fact, I needed to ask the model to trim the image, as initially it was also generating the labels for the points outside of the zoomed-in area. In any case, the fixed zoomed image is:

Comparison of each model: average score vs average prompt cost (zoomed on the Pareto frontier)

Then, vibe-coded the following to display the models on the Pareto frontier:

# Display the models on the Pareto frontier
print("Models on the Pareto Frontier:")
display(frontier_df['Model'])

These are:

gpt-3.5-turbo
claude-3-haiku
Deepseek
gemini-1.5-flash-8b
gemini-1.5-flash
gemini-2.0-flash
gemini-2.0-flash-thinking
gemini-1.5-pro

Of course, the Gemini models are showing up here because they all get the same score. I should probably find out how to get the costs for each individual model and update the article. Fortunately, this is as simple as updating a cell in the colab and then running it again, saving the images and copying them to the blog system.

Effect of prompt hints 🔗

Besides determining which model is “better”, the other reason of this competition was to determine if prompt-engineering really has some effect. Since the prompt hints are additive, we can compute the effect of each of them as the following ratio:

\[ Effect(P) = \frac{Score(P)}{Score(~P)} \]

where \(Score(P)\) means the average score of the model when the prompt hint is being included and \(Score(~P)\) is the average score of the model when the prompt hint is not included. This can be vibecoded as:

hint_effects = {}

# Iterate over each individual hint
for hint in ['S', 'U', 'R', 'P', 'G', 'C']:
    hint_effects[hint] = {}

    # Filter dataframes where the prompt contains the current hint
    df_with_hint = combined_df[combined_df['Prompt hints'].str.contains(hint, na=False)]
    # Filter dataframes where the prompt does NOT contain the current hint
    df_without_hint = combined_df[~combined_df['Prompt hints'].str.contains(hint, na=False)]

    # Compute average scores for models with the hint
    avg_scores_with_hint = df_with_hint.groupby('Model')['Score'].mean().reset_index()
    avg_scores_with_hint = avg_scores_with_hint.rename(columns={'Score': f'Average Score with {hint}'})

    # Compute average scores for models without the hint
    avg_scores_without_hint = df_without_hint.groupby('Model')['Score'].mean().reset_index()
    avg_scores_without_hint = avg_scores_without_hint.rename(columns={'Score': f'Average Score without {hint}'})

    # Merge the two dataframes
    merged_scores = avg_scores_with_hint.merge(avg_scores_without_hint, on='Model', how='outer')

    # Calculate the ratio in average scores
    merged_scores[f'Effect of {hint}'] = merged_scores[f'Average Score with {hint}'] / merged_scores[f'Average Score without {hint}']

    hint_effects[hint] = merged_scores

# Display the results for each hint
for hint, df in hint_effects.items():
    print(f"\nEffect of hint '{hint}':")
    display(df)

For the You are a famous mathematician, an expert in number theory prompt, we have the following:

The effect of the skill hint
Model	With hint	Without hint	Ratio

`gpt-3.5-turbo`	2.36	2.81	0.84
`gpt-4`	3.39	5.15	0.66
`gpt-4-turbo`	3.84	5.60	0.68
`gpt-4o`	5.19	6.50	0.80
`gpt-4o-mini`	4.46	3.25	1.37
`o1-mini`	4.96	3.74	1.33

`gemini-pro`	2.48	0.89	2.77
`gemini-1.5-pro`	5.63	6.67	0.84
`gemini-1.5-flash`	4.58	3.49	1.31
`gemini-1.5-flash-8b`	4.15	3.40	1.22
`gemini-2.0-flash`	4.99	3.97	1.26
`gemini-2.0-flash-thinking`	5.33	3.91	1.36

`claude-3-haiku`	3.60	2.61	1.38
`claude-3-opus`	3.42	3.29	1.04
`claude-3-sonnet`	3.10	2.68	1.16
`claude-3.5-haiku`	4.63	3.67	1.26
`claude-3.5-sonnet`	5.05	4.82	1.05

`mistral-tiny`	1.24	0.41	3.04
`mistral-small`	3.04	2.59	1.17
`mistral-large`	4.41	3.37	1.31
`ministral-3b`	1.56	1.60	0.97
`ministral-8b`	1.81	1.62	1.11

`llama-3.1-sonar-small`	3.21	4.91	0.65
`llama-3.1-sonar-large`	4.60	6.11	0.75
`llama-3.1-sonar-huge`	5.13	6.27	0.82
`llama-3.3-70b-instruct`	4.35	3.24	1.34

QwQ	4.90	3.70	1.32
Deepseek	3.78	3.42	1.11

That is, telling the model that it has the skill needed to solve the problem results in a significant performance increase for mistral-tiny and gemini-pro. In total, 19 models seem to benefit from this, out of 28.

For the Solving this problem is important prompt, we have the following:

The effect of the urgency hint
Model	With hint	Without hint	Ratio

`gpt-3.5-turbo`	3.07	2.39	1.28
`gpt-4`	4.15	4.56	0.91
`gpt-4-turbo`	4.43	5.11	0.87
`gpt-4o`	5.65	6.12	0.92
`gpt-4o-mini`	3.98	3.63	1.10
`o1-mini`	4.44	4.12	1.08

`gemini-pro`	1.22	1.68	0.72
`gemini-1.5-pro`	5.95	6.41	0.93
`gemini-1.5-flash`	4.06	3.86	1.05
`gemini-1.5-flash-8b`	3.89	3.60	1.08
`gemini-2.0-flash`	4.69	4.22	1.11
`gemini-2.0-flash-thinking`	4.81	4.31	1.12

`claude-3-haiku`	3.15	2.94	1.07
`claude-3-opus`	3.29	3.38	0.97
`claude-3-sonnet`	2.54	3.00	0.85
`claude-3.5-haiku`	4.10	4.03	1.02
`claude-3.5-sonnet`	4.54	5.10	0.89

`mistral-tiny`	0.29	0.97	0.30
`mistral-small`	2.78	2.77	1.00
`mistral-large`	4.00	3.68	1.09
`ministral-3b`	1.33	1.72	0.78
`ministral-8b`	2.00	1.55	1.29

`llama-3.1-sonar-small`	3.96	4.36	0.91
`llama-3.1-sonar-large`	5.40	5.56	0.97
`llama-3.1-sonar-huge`	5.64	5.91	0.96
`llama-3.3-70b-instruct`	4.07	3.50	1.16

QwQ	4.52	4.02	1.13
Deepseek	3.72	3.48	1.07

Here, the model performance with and without the hint is mostly similar. There’s no significant increase in performance and neither is there a significant degradation (except mistral-tiny).

For the You will get rewarded if you get the right answer prompt, we have the following:

The effect of the reward hint
Model	With hint	Without hint	Ratio

`gpt-3.5-turbo`	2.61	2.63	0.99
`gpt-4`	4.15	4.58	0.91
`gpt-4-turbo`	4.90	4.86	1.00
`gpt-4o`	5.90	5.99	0.99
`gpt-4o-mini`	3.76	3.75	1.00
`o1-mini`	4.29	4.19	1.02

`gemini-pro`	0.70	1.97	0.36
`gemini-1.5-pro`	6.13	6.32	0.97
`gemini-1.5-flash`	3.89	3.95	0.98
`gemini-1.5-flash-8b`	3.88	3.60	1.08
`gemini-2.0-flash`	4.44	4.34	1.02
`gemini-2.0-flash-thinking`	4.58	4.42	1.02

`claude-3-haiku`	3.16	2.93	1.08
`claude-3-opus`	3.32	3.36	0.99
`claude-3-sonnet`	2.65	2.95	0.90
`claude-3.5-haiku`	4.10	4.02	1.02
`claude-3.5-sonnet`	4.79	4.98	0.96

`mistral-tiny`	0.86	0.68	1.27
`mistral-small`	3.12	2.59	1.21
`mistral-large`	3.85	3.75	1.03
`ministral-3b`	1.53	1.62	0.95
`ministral-8b`	2.07	1.50	1.38

`llama-3.1-sonar-small`	4.08	4.31	0.95
`llama-3.1-sonar-large`	5.44	5.54	0.98
`llama-3.1-sonar-huge`	5.86	5.79	1.01
`llama-3.3-70b-instruct`	3.80	3.63	1.05

QwQ	4.28	4.13	1.04
Deepseek	3.32	3.69	0.90

This is similar to the urgency prompt above, not many differences between having the prompt and not having it.

For the You will get punished if you get the wrong answer prompt, we have the following:

The effect of the punishment hint
Model	With hint	Without hint	Ratio

`gpt-3.5-turbo`	2.79	2.55	1.09
`gpt-4`	4.38	4.44	0.99
`gpt-4-turbo`	5.04	4.80	1.05
`gpt-4o`	6.00	5.94	1.01
`gpt-4o-mini`	3.64	3.80	0.96
`o1-mini`	4.19	4.24	0.99

`gemini-pro`	0.72	1.87	0.38
`gemini-1.5-pro`	6.30	6.24	1.01
`gemini-1.5-flash`	3.74	4.01	0.93
`gemini-1.5-flash-8b`	3.77	3.66	1.03
`gemini-2.0-flash`	4.38	4.37	1.00
`gemini-2.0-flash-thinking`	4.41	4.50	0.98

`claude-3-haiku`	2.46	3.24	0.76
`claude-3-opus`	3.18	3.42	0.93
`claude-3-sonnet`	2.63	2.93	0.90
`claude-3.5-haiku`	3.69	4.20	0.88
`claude-3.5-sonnet`	4.28	5.18	0.83

`mistral-tiny`	0.67	0.77	0.86
`mistral-small`	3.03	2.66	1.14
`mistral-large`	3.63	3.85	0.94
`ministral-3b`	1.77	1.51	1.18
`ministral-8b`	2.02	1.56	1.29

`llama-3.1-sonar-small`	4.12	4.28	0.96
`llama-3.1-sonar-large`	5.78	5.38	1.07
`llama-3.1-sonar-huge`	6.02	5.73	1.05
`llama-3.3-70b-instruct`	3.68	3.69	1.00

QwQ	4.08	4.23	0.96
DeepSeek	3.49	3.59	0.97

Here, too, we see the same behavior. The hint has bad effect on gemini-pro and the others are relatively not changed much.

For the Think step by step prompt, we have the following:

The effect of the gradual hint
Model	With hint	Without hint	Ratio

`gpt-3.5-turbo`	2.37	2.73	0.86
`gpt-4`	4.56	4.36	1.04
`gpt-4-turbo`	4.65	4.96	0.93
`gpt-4o`	5.85	6.00	0.97
`gpt-4o-mini`	3.99	3.65	1.09
`o1-mini`	4.57	4.09	1.11

`gemini-pro`	3.29	0.82	3.97
`gemini-1.5-pro`	6.36	6.21	1.02
`gemini-1.5-flash`	4.06	3.87	1.04
`gemini-1.5-flash-8b`	3.62	3.72	0.97
`gemini-2.0-flash`	4.49	4.33	1.03
`gemini-2.0-flash-thinking`	4.89	4.31	1.13

`claude-3-haiku`	3.42	2.84	1.20
`claude-3-opus`	3.56	3.25	1.09
`claude-3-sonnet`	3.08	2.74	1.12
`claude-3.5-haiku`	4.21	3.99	1.05
`claude-3.5-sonnet`	4.75	4.97	0.95

`mistral-tiny`	1.64	0.38	4.21
`mistral-small`	2.82	2.75	1.02
`mistral-large`	3.89	3.74	1.03
`ministral-3b`	2.05	1.40	1.45
`ministral-8b`	1.37	1.82	0.75

`llama-3.1-sonar-small`	4.39	4.16	1.05
`llama-3.1-sonar-large`	5.26	5.59	0.94
`llama-3.1-sonar-huge`	5.74	5.84	0.98
`llama-3.3-70b-instruct`	3.67	3.69	0.99

QwQ	4.32	4.13	1.04
Deepseek	3.79	3.47	1.09

Here, gemini-pro and mistral-tiny benefit significantly from the hint.

Finally, for the Double check your answers prompt, we have the following:

The effect of the check hint
Model	With hint	Without hint	Ratio

`gpt-3.5-turbo`	2.32	2.74	0.84
`gpt-4`	4.37	4.44	0.98
`gpt-4-turbo`	4.77	4.91	0.97
`gpt-4o`	5.72	6.05	0.94
`gpt-4o-mini`	3.93	3.67	1.06
`o1-mini`	4.47	4.13	1.08

`gemini-pro`	3.11	0.90	3.45
`gemini-1.5-pro`	6.24	6.26	0.99
`gemini-1.5-flash`	4.08	3.86	1.05
`gemini-1.5-flash-8b`	3.59	3.73	0.96
`gemini-2.0-flash`	4.57	4.29	1.06
`gemini-2.0-flash-thinking`	4.86	4.32	1.12

`claude-3-haiku`	3.32	2.88	1.15
`claude-3-opus`	3.64	3.22	1.13
`claude-3-sonnet`	2.98	2.78	1.07
`claude-3.5-haiku`	4.23	3.97	1.06
`claude-3.5-sonnet`	5.20	4.79	1.08

`mistral-tiny`	1.11	0.59	1.88
`mistral-small`	2.65	2.81	0.94
`mistral-large`	3.75	3.79	0.99
`ministral-3b`	0.97	1.83	0.53
`ministral-8b`	1.95	1.59	1.22

`llama-3.1-sonar-small`	4.08	4.28	0.95
`llama-3.1-sonar-large`	5.11	5.65	0.90
`llama-3.1-sonar-huge`	5.73	5.84	0.98
`llama-3.3-70b-instruct`	3.71	3.67	1.01

QwQ	4.41	4.09	1.07
Deepseek	3.82	3.45	1.10

This also raises the scores of gemini-pro significantly, without altering the other ones too much.

In other words, prompt engineering helps the weaker models, but it can also harm them. More advanced models don’t seem to react that much on these prompts – which, granted, are still generic.

Parting words 🔗

With this, we finally finish the AI 2025 competition on this blog. Well, that’s likely not totally true. This is because I already have some ideas for other things to test the LLMs on, and there are new LLMs and some of the ones tested so far are no longer available. I just need to be faster in these competitions, half a year for a full summary is very long, especially in a field that moves this fast. So, … stay tuned.

Parsing the data 🔗

Average model performances 🔗

Effect of prompt hints 🔗

Parting words 🔗

Comments: