Introducing the 2025 AI puzzle competition

At the end of last year I had a significant amount of OpenAI API credits which were set to expire by the end of the month. I bought them when I experimented with creating fuzzing harnesses via LLMs as part of OSSFuzz project, when I helped the project to use OpenAI models and test their performance. Rather than let them expire, I decided to create new puzzles and test the LLMs on them, just like the last time.

The main difference is that this time I wanted to use the API instead of the chat interface (since the chat interface would not consume the API credits). And, I wanted to compare performance across a variety of models. From this, I also thought of expanding to other LLM providers, so I considered testing Gemini models, and so on.

To make this a learning experience, I decided to compare all these LLMs on multiple puzzles and test how various prompts can change the model performance.

Learning from the previous article ¹, I decided that, rather than creating a single large article about all these experiments, I would split the report across multiple blog posts and post them over the next few days. This also gives you an opportunity to suggest other models to test on, or other prompts or puzzles.

For this introductory article, I only want to present the rules of the competition and the infrastructure that I set up to start running it. I will not include any of the puzzle questions or prompt engineering snippets in this article. Nor would I hint at model performance results. That is for subsequent articles.

Note: While writing this post, I realized that some of the scaffolding could be written in a better way. I will present it as it was for this year. If I decide to repeat this experiment (or similar) in future years, I would update the infrastructure to the better alternative.

General scoring rules 🔗

I will send the same prompts to every model under test. These prompts are all formed from main questions (the puzzles that I want the model to solve) and, optionally, some prompt engineering hints. Every model will receive the same combinations, in the same order.

Every model will provide 10 (hopefully different) completions to the prompt. Each answer will be scored 1 point if it is complete and correct.

To penalize models that hallucinate, every mistake in the answer will be penalized by 0.1 points. Thus, it is possible to have models that score negatively.

These scores would be presented on future articles, together with relevant examples from what the models produce. In this article, I’ll focus on just the scaffolding needed to run the competition.

Setup for questions and prompt engineering hints 🔗

To make sure I use the exact same prompts for all models, I created two files. First, I created questions to list the base questions for each of the puzzles. Each puzzle could be text on a single line, or text on multiple lines, such as in this example (not real questions):

What is 2 + 2?
-------
What is the best move to win this tic-tac-toe game?
O |   | O
--+---+---
X | X |
--+---+---
  |   | X

Then, I created a file prompts with hints to use as prompt engineering:

You are an helpful bot.
You are an expert at playing games.
You are an expert mathematician.

The idea is that I would then write code to combine prompt hints and questions to generate various prompts to send to the LLM. For example:

You are an helpful bot.
You are an expert mathematician.
What is 2 + 2?

I decided to format all prompts in the same way: first each of the prompt engineering hints on their own line and then the question, exactly as written on the questions file. I would use code to create these mixtures.

Since not all prompt engineering hints would be relevant for each questions, the infrastructure would offer a way to filter out the invalid combinations.

Note: As I’m writing this, I realize that I could have separated each question to its own file and then use some configuration (in a separate file, or in the same file) to configure the prompts. But, this improvement is left as an exercise to the reader. I might do it if I decide to run this competition in the next year.

This setup makes it very easy to create the infrastructure to compose the prompts. First, some imports and some constants to describe questions and prompt hints:

from collections.abc import Generator, Iterable
import pathlib
from typing import Literal


QUESTION_FILE: str = "questions"
PROMPT_FILE : str = "prompts"
# example values here
PROBLEMS : list[list[int]] = [[0], [2,3,4,5,6,7]]
PROMPTS : list[list[int]] = [[], [0], [1], [2], [0,2]]

The [] in PROMPTS means that we don’t apply any prompt engineering hint for this scenario. The [2,3] means that both lines would be combined, just like in the example mentioned above.

Next, we have some helpers to read the question and the prompt hints from the given list of lines that compose them:

def read_indexed_lines_from_file(file: str, indexes: Iterable[int]) -> str:
    with open(file, "r") as f:
        filelines = f.readlines()
        lines = [filelines[index].rstrip() for index in indexes]
    return "\n".join(lines)


def read_main_question(indexes: Iterable[int]) -> str:
    return read_indexed_lines_from_file(QUESTION_FILE, indexes)


def read_prompt_header(indexes: Iterable[int]) -> str:
    return read_indexed_lines_from_file(PROMPT_FILE, indexes)

There are some assumptions here about the indexes. The files might be opened and read more times than needed. This is another reason to change this infrastructure if I’m going to repeat these experiments.

With the above helpers, it is easy to compose the prompt for a given puzzle to send to the LLM. There is one caveat though: OpenAI models use prompts structured as a list of JSON documents, each describing a role and a message. Gemini models use all the prompt as a single string. So, we need to support both cases. We create two functions for these:

def compose_problem_string(
    question: Iterable[int],
    prompt: Iterable[int],
) -> str:
    header = read_prompt_header(prompt)
    ask = read_main_question(question)

    if header: return "\n".join([header, ask])
    return ask


def compose_problem_dict(
    question: Iterable[int],
    prompt: Iterable[int],
    prompt_role: Literal["developer", "user"] = "developer",
) -> list[dict[str, str]]:
    if prompt_role == "user":
        messages = [{
            "role": prompt_role,
            "content": compose_problem_string(question, prompt),
        }]
    else:
        header = read_prompt_header(prompt)
        ask = read_main_question(question)
        messages = [{"role": prompt_role, "content": header}] if header else []
        messages.append({"role": "user", "content": ask})

    return messages

While compose_problem_string is simple, composing the question as a dictionary needs slightly more discussion. Since I also want to test the impact of system prompts, I allow passing in a value for prompt_role. By default, it is "developer" (formerly known as “system”). Some OpenAI models only support user mode. In that case, we concatenate both the prompt engineering hints and the questions in a single JSON message.

Finally, I’m creating generators that would yield every combination of engineered prompts for every puzzle:

def build_valid_pairs(
    problems: Iterable[Iterable[int]] = [],
    prompts: Iterable[Iterable[int]] = [],
) -> Generator[tuple[Iterable[int], Iterable[int]]]:
    for pb in problems:
        for pr in prompts:
            if valid(pb, pr):  # Note: not presented in article
                yield (pb, pr)


def build_prompt_strings(
    problems: Iterable[Iterable[int]] = [],
    prompts: Iterable[Iterable[int]] = [],
) -> Generator[str]:
    if not problems: problems = PROBLEMS
    if not prompts: prompts = PROMPTS
    for problem, prompt in build_valid_pairs(problems, prompts):
        yield compose_problem_string(problem, prompt)


def build_prompt_dicts(
    problems: Iterable[Iterable[int]] = [],
    prompts: Iterable[Iterable[int]] = [],
    prompt_role: Literal["developer", "user"] = "developer",
) -> Generator[list[dict[str, str]]]:
    if not problems: problems = PROBLEMS
    if not prompts: prompts = PROMPTS
    for problem, prompt in build_valid_pairs(problems, prompts):
        yield compose_problem_dict(problem, prompt, prompt_role)

Since not all prompt engineering hints apply to every problem, I use build_valid_pairs to filter the invalid combinations.

The build_prompt_* methods allow me to select only a subset of problems and prompts to use or use all of the defined ones. This way, I can run new puzzles by just changing some parameters in the runner script.

It would be better to actually control all of these via flags, but that’s for a future iteration of this competition. Instead, let’s look at how we can call the OpenAI LLMs.

Running OpenAI models 🔗

I decided to test 6 different OpenAI models:

import openai
import time

import prompt_api  # The module defined in the previous section


MODELS = [
    "gpt-3.5-turbo",  # gpt-3.5-turbo-0125
    "gpt-4",  # gpt-4-0613
    "gpt-4-turbo",  # gpt-4-turbo-2024-04-09
    "gpt-4o",  # gpt-4o-2024-08-06
    "gpt-4o-mini",  # gpt-4o-mini-2024-07-18
    "o1-mini",  # o1-mini-2024-09-12
]

I added comments for each one to point each alias to the model it resolved to when I ran the OpenAI models for this competition.

Note that I am not testing the o1 model itself. This is because it is only available from paid tier 5 and above, but I was at tier 3 when the experiments started.

To list what models I can use, I have this simple helper

def debug_print_models(client):
    print(f"Available models:")
    for m in client.models.list():
        print(m)

And, to send a prompt to the model and get all the answers I use

def get_answer(client, model, prompt, n=10):
    print(f">>> {model}, {prompt}")
    try:
        answers = client.chat.completions.create(
            messages=prompt,
            model=model,
            n=n,
        )
        for answer in answers.choices:
            print(answer.message.content)
            print("----")
    except openai.RateLimitError as e:
        print(e)
        time.sleep(1)
        get_answer(client, model, prompt, n)

I want to score each model on 10 different results, but some models don’t support generating more than 5 completions. So, we use n argument to control that.

To be a good citizen, I treat rate limit errors by sleeping for 1 second and then retrying the question. Here, I do this by recursively calling the same function. A good improvement would be to actually use a loop here and a better one would be to actually implement exponential back-off. Both are left as exercises to the reader.

Running the script is easy:

client = openai.OpenAI()
for model in MODELS:
    print(f"\n=========== {model} ===========")
    role = "user" if model == "o1-mini" else "system"
    for prompt in prompt_api.build_prompt_dicts(prompt_role=role):
        get_answer(client, model, prompt, n=5)
        get_answer(client, model, prompt, n=5)

This part can also be changed to run all models in user mode, to filter models, questions, and prompt hints. I could have done all of this via flags, but that was too much work for something that I thought would have been just one single script to run only once.

Running other models 🔗

While the OpenAI models were being tested, I decided to also run the same experiments on other models. Rather than creating runner scripts for each one of them, I decided to use the really helpful llm library from Simon Willison.

The library is available on NixOS, but not at the latest version. The version that is available does not support listing available models, however. And, no plugin library is available. Hence, I first need to create the proper nix shell that brings in all the needed packages. I use something like

let
  pkgs = import <nixpkgs> {};

  python = pkgs.python3.override {
    self = python;
    packageOverrides = pyfinal: pyprev: {
      llm = pyfinal.callPackage ./llm.nix {};
      llm-gemini = pyfinal.callPackage ./llm-gemini.nix {};
      # ...
    };
  };
in

pkgs.mkShell {
  packages = with pkgs; [
    (python.withPackages (pypkgs: with pypkgs; [
      llm
      llm-gemini
      # ...
    ]))
  ];

  shellHook = ''
    export PS1="[\[\033[01;32m\]nix-shell\[\033[00m\]:\W] \[\033[01;32m\]λ\[\033[00m\] "
  '';
}

Where llm.nix is the setup for the latest release of llm:

{
  buildPythonPackage,
  fetchPypi,
  setuptools,
  pluggy,
  click,
  openai,
  click-default-group,
  sqlite-utils,
  sqlite-migrate,
  pydantic,
  pyyaml,
  python-ulid,
  pip,
  puremagic,
}:

buildPythonPackage rec {
  pname = "llm";
  version = "0.19.1";
  pyproject = true;

  src = fetchPypi {
    inherit pname version;
    sha256 = "sha256-ZPDJUA7Can3mGjoHsfDxzdMzp1PEp6uneRtO081UEX8=";
  };

  doCheck = false;

  pythonImportsCheck = [
    "llm"
  ];

  build-system = [
    setuptools
  ];

  dependencies = [
    pluggy
    openai
    click-default-group
    sqlite-utils
    sqlite-migrate
    pydantic
    pyyaml
    python-ulid
    pip
    puremagic
  ];
}

And llm-gemini.nix is the configuration for the llm-gemini plugin:

{
  buildPythonPackage,
  fetchPypi,
  setuptools,
  llm,
  httpx,
  ijson,
}:

buildPythonPackage rec {
  pname = "llm_gemini";
  version = "0.8";
  pyproject = true;

  src = fetchPypi {
    inherit pname version;
    sha256 = "sha256-AaBeQG5JLgVXjZ3weW9FmwGKCAvTYiEPJeWxP+n3BUA=";
  };

  doCheck = false;

  build-system = [
    setuptools
  ];

  dependencies = [
    llm
    httpx
    ijson
  ];
}

For other plugins (for other LLMs), I need to write similar configurations.

The runner script is quite similar to the one used for OpenAI. I also define the list of models to use:

import llm
import time

import prompt_api


MODELS = [
    "gemini-pro",
    "gemini-1.5-pro-002",
    "gemini-1.5-flash-002",
    "gemini-1.5-flash-8b-001",
    "gemini-2.0-flash-exp",
    "gemini-2.0-flash-thinking-exp-1219",
    # ...
]

This is a subset of models that are available, discovered via

def debug_print_models():
    print(f"Available models:")
    for m in llm.get_models():
        print(m.model_id)

The important changes are in get_answer:

def get_answer(model_id, prompt, n=10):
    print(f">>> {model_id}, {repr(prompt)}")
    model = llm.get_model(model_id)
    for _ in range(n):
        wait_time = 1
        while True:
            try:
                response = model.prompt(prompt)
                print(response)
                print("----")
                break  # while
            except Exception as e:
                print(e, wait_time)
                time.sleep(wait_time)
                wait_time *= 2

The library does not offer a way to distinguish the types of exceptions (to filter out of quota, lack of credit, rate limits, etc.) so instead I implemented a proper exponential backoff and I need to monitor the script’s runtime to stop it on other errors.

Also, the library supports getting only one completion per prompt, so I need to manually loop to get the required n=10.

The running loop is much simpler too:

for model in MODELS:
    print(f"\n=========== {model} ===========")
    for prompt in prompt_api.build_prompt_strings():
        get_answer(model, prompt)

This is just the basics of integrating the llm library into the scaffolding. For some models, I might want to run with additional options, in which case get_answer and the main loop would change accordingly. It does not make sense to list all of these for this article, though.

With this, we now have the entire scaffolding to run all the experiments. I would end this article here. Tomorrow, I’ll post the contents of the puzzles. See you!

I would be remiss to not talk about the long hiatus here. After the graph database article I decided to spend some time to refactor the blog to make that more readable. I managed to implement some of the changes, but, before I could implement the larger ones, other stuff got in the way. As mentioned on first of January posts on social media, this year I want to make posting more on here a priority.↩︎

General scoring rules 🔗

Setup for questions and prompt engineering hints 🔗

Running OpenAI models 🔗

Running other models 🔗

Comments: