Testing 80 LLMs on spatial reasoning on grids

In the last article I introduced two different problems that I was curious how LLMs would answer. For that article, I ran a simple Kaggle experiment, but now it’s time for a more comprehensive study.

To summarize, the problems involve laying out numbers in a grid, locating the position of 2026, and then summing out its neighbors. The only difference between the two problems is that the first problem uses a square grid (8 neighbors) whereas the second one uses a hexagonal grid (6 neighbors). The full content of the prompts is in the last article.

Before continuing, how do you think the models did? Which of the problems was more difficult? The results will surprise you…

The remainder of the article is split in 4 parts. I’ll start with a description of the test harness, then discuss the experimental results. Next, there will be a short section where I analyze some of the outputs. Finally, there is a conclusions section, summarizing this article and discussing the next steps. You can use any of these links to skip to the corresponding section.

The test harness 🔗

As the Kaggle experiment only allowed me to test a small number of models (there are more available now), and only for one single iteration (I was told that this will change in the future), I implemented a test harness around the llm library.

Since I use NixOS, I first need to define the shell.nix file. Unlike last year, most of the packages are already in the nixpkgs repo, so the setup is simpler:

let
  pkgs = import <nixpkgs> {};
in

pkgs.mkShell {
  packages = with pkgs; [
    (python3.withPackages (pypkgs: with pypkgs; [
      llm
      llm-gemini
      llm-perplexity
      llm-mistral
      llm-grok
      llm-anthropic
    ]))
  ];

  shellHook = ''
    export PS1="[\[\033[01;32m\]nix-shell\[\033[00m\]:\W] \[\033[01;32m\]λ\[\033[00m\] "
  '';
}

This gives me the following versions:

[...] λ pip list | grep llm
llm                 0.27.1
llm-anthropic       0.20
llm-gemini          0.27
llm-grok            1.3.0
llm-mistral         0.15
llm-perplexity      2025.10.0

Note that these were the versions at the time of writing this article, which is a few weeks after actually running the experiments. It is thus possible that some of these versions have changed – one of my tasks for this year is to migrate to a reproducible experiment harness for all the experiments mentioned in this blog.

With the above setup and some additional changes in extra-openai-models.yaml, as per the llm documentation, we can experiment with models from various providers.

But, I don’t want to run across all of the model names that become available. There are some models to which I don’t have access yet – I need to spend more on the provider until I get to higher access levels, just like I couldn’t test o1 in the experiments from the last year. There are models that show up as both the model name and several aliases. There are some models that are already deprecated: listed as such in the documentation or already the API refuses to answer calls for that specific slug. More importantly, there are models that are not suitable for the task – they expect a different modality. There are a few Gemma models I couldn’t run due to a bug in the llm library. Finally, I also wanted to exclude some models that are either clones of existing ones, very expensive, or for any other reasons. So, I have this big global in the script:

_forbidden_models = set({
    # Missing access to model
    "gpt-4-32k",
    ...
    # aliases
    "gpt-3.5-turbo-16k",                    # gpt-3.5-turbo
    ...
    "grok-3-mini-fast-latest",              # grok-3-mini-latest
    # deprecated (+ aliases)
    "o1-preview",
    ...
    "gemini/gemini-2.5-pro-preview-06-05",
    # Not present in the official documentation (deprecated)
    "gemini/gemini-pro",
    ...
    "gemini/gemini-2.5-flash-preview-04-17",
    # Task not suitable (e.g., models are audio)
    "gpt-4o-audio-preview",
    ...
    "grok-2-vision-latest",
    # Media resolution bug (https://github.com/simonw/llm-gemini/issues/116)
    "gemini/gemma-3-1b-it",
    "gemini/gemma-3-4b-it",
    "gemini/gemma-3-12b-it",
    "gemini/gemma-3-27b-it",
    "gemini/gemma-3n-e4b-it",
    # Other reasons
    "r1-1776",       # DeepSeek model, different provider
    "grok-2-latest", # Old and costly
    "grok-4-latest", # Very slow, and costly ($1.16 for only 4 answers in over 12 hours)
})

At this point, I am left with exactly 80 models I can run on.

The test harness uses click and rich to build a nicer interface. This allows me to define command line arguments so I can run the harness with only a subset of the models, and see the progress in a nicer way – last year I was tailing log files, I don’t want to do that again, ever. Hence, the outer layer of the harness looks like the following:

_logger = logging.getLogger(__name__)


@click.command()
@click.option(
    "--log_level",
    default=logging.INFO,
    help="Log verbosity level.",
)
@click.option(
    "-p",
    "--prompts",
    type=click.File("r"),
    help="File(s) with prompt text.",
    multiple=True,
)
@click.option(
    "-m",
    "--models",
    type=str,
    metavar="MODEL",
    help="Models to query.",
    multiple=True,
)
@click.option(
    "-n",
    "--rounds",
    type=int,
    metavar="ROUNDS",
    help="Number of rounds of querying.",
    default=10,
)
def main(
    log_level: int,
    prompts: Collection[click.File],
    models: Collection[str],
    rounds: int,
) -> None:
    """The main entry point."""
    _setup_logging(log_level)
    _logger.info("Prompt files: %s", prompts)
    _logger.info("Models asked for: %s", models)
    _logger.info("Number of rounds: %d", rounds)
    if not prompts:
        _logger.debug("No prompt given, will just list the models..")
        click.echo("Available models:")
        for model in llm.get_models():
            if model.model_id not in _forbidden_models:
                click.echo(f"\t{model.model_id}")
        raise SystemExit
    if models:
        models = [llm.get_model(m) for m in models]
    else:
        models = []
        for model in llm.get_models():
            if model.model_id not in _forbidden_models:
                models.append(model)
    _run_prompts(prompts, models, rounds)


def _setup_logging(level: int = logging.INFO) -> None:
    """Configures the global logging for the entire run."""
    global _logger
    handler = rich_logging.RichHandler(
        show_level=False, show_time=False, show_path=False
    )
    handler.setFormatter(
        logging.Formatter(
            fmt="%(asctime)s.%(msecs)d - %(levelname).1s - %(message)s",
            datefmt="%m/%d/%Y %H:%M:%S",
        )
    )
    _logger.addHandler(handler)
    _logger.setLevel(level)


if __name__ == "__main__":
    main()

The main part is in _run_prompts. Having a list of prompts and models, I set up the entire TUI (well, just progress bars), and then call the proper APIs, check the answer and record it:

def _run_prompts(
    prompts: Collection[click.File],
    models: Collection[llm.Model],
    rounds: int,
) -> None:
    """Runs all prompts on all models."""
    now = datetime.datetime.today()
    root_folder = pathlib.Path(now.strftime("%y%m%d-%H%M%S"))
    num_model_runs = len(models) * rounds
    num_tasks = len(prompts) * num_model_runs
    with rich_progress.Progress(
        rich_progress.TextColumn("{task.description}"),
        rich_progress.TaskProgressColumn(),
        rich_progress.BarColumn(None),
        rich_progress.MofNCompleteColumn(),
        rich_progress.TextColumn("•"),
        rich_progress.TimeElapsedColumn(),
        rich_progress.TextColumn("/"),
        rich_progress.TimeRemainingColumn(),
    ) as pbar:
        total_task = pbar.add_task("Total run progress", total=num_tasks)
        prompt_task = pbar.add_task("placeholder", total=num_model_runs)
        model_task = pbar.add_task("placeholder", total=rounds)
        score_task = pbar.add_task("Model score on task", total=rounds)
        for prompt in prompts:
            prompt_name = _canonicalize(prompt.name)
            prompt_folder = root_folder / prompt_name
            pbar.reset(prompt_task, description=prompt_name)
            prompt_text, expected_answer = _read_prompt(prompt)
            for model in models:
                model_name = _canonicalize(model.model_id)
                model_folder = prompt_folder / model_name
                model_folder.mkdir(parents=True)
                pbar.reset(model_task, description=model_name)
                pbar.reset(score_task)
                correct = 0
                pbar.reset(score_task)
                for i in range(rounds):
                    out = model_folder / f"{i}"
                    if _get_answer(prompt_text, expected_answer, model, out):
                        correct += 1
                        pbar.update(score_task, advance=1)
                    pbar.update(model_task, advance=1)
                    pbar.update(prompt_task, advance=1)
                    pbar.update(total_task, advance=1)
                _logger.info(
                    f"{correct}/{rounds} for {model_name} on {prompt_name}"
                )
                (model_folder / "score").write_text(f"{correct}\n")

The TUI logs the score for each model, so it is easy for me to collect the performance metrics. I also store the score in a simple score file, to make it easier to merge results from completed runs.

The prompts are stored in two separate files, such that the last line contains the answer. That is, this is the prompt for the square grid problem

The natural numbers are written on a spiral laid out on a square grid:

... 5 4 3 ...
... 6 1 2 ...
... 7 8 9 ...

What is the sum of all 8 neighbors of the number 2026?
16408

And this is the prompt for the hexagonal grid problem:

The natural numbers are written on a spiral laid out on a hexagonal grid:

... 14 4 3 10 ...
.. 15 5 1 2 9 ...
... 16 6 7 8 ...

What is the sum of all 6 neighbors of the number 2026?
12168

I then need a helper that reads these files, extracts the correct prompt and the answer as two separate variables:

def _read_prompt(prompt: click.File) -> Tuple[str, int]:
    """Parses a prompt file, extracting the prompt and the answer."""
    lines = [l.rstrip() for l in prompt.readlines()]
    return "\n".join(lines[:-1]), int(lines[-1])

Since the output of every model is stored in a directory named after the model slug, which may contain characters that are not valid for a path, I have this simple helper which just replaces / with _:

def _canonicalize(name: str) -> str:
    """Makes `name` be usable as a filename/task."""
    return name.replace("/", "_")

The last thing left to do is to actually prompt the model and check the answer. Rather than checking the answer manually, like I did in the last year, I attempted to parse the answer inline and reconstruct the numbers contained in the output. The expectation was that the last number from the output is the answer that the LLM gives, but this was naive:

def _get_answer(
    prompt: str,
    expected: int,
    model: llm.Model,
    log: pathlib.Path,
) -> bool:
    """Queries `model` on `prompt`, expecting `expected`, loging to `log`."""
    wait_time = 1
    while True:
        try:
            with log.open('w') as f:
                response = model.prompt(prompt)
                last_number, current_number = 0, 0
                for part in response:
                    f.write(part)
                    for c in part:
                        if c in "0123456789":
                            current_number = 10 * current_number + int(c)
                        elif current_number:
                            last_number, current_number = current_number, 0
                    if current_number:
                        last_number = current_number
                f.write("\n----\n")
                f.write(f"{response.usage()}\n")
                return last_number == expected
        except Exception as e:
            _logger.exception(
                "Model %s failed, wait %d seconds", model.model_id, wait_time
            )
            time.sleep(wait_time)
            wait_time *= 2

There are two reasons why the above reconstruction scheme does not work. First, some models put references at the end of the output, using a [1] <link> format. Some models might put additional content, that also contains numbers. But, more important, some models use a thousands separator, while others don’t. In the end, the number that I get in the log, from the above code, is just a lower bound on how well the model performs. I still have to inspect the output files to see if they contain the correct answer anywhere. But, this can be done with a simple grep:

[...] λ rg '12168|12,168' results/prompts_hexagon/ -l | cut -d/ -f3 | sort | uniq -c
      ...

Besides doing an exponential backoff in _get_answer, I also record the output of response.usage() at the end of the model output. This contains the number of tokens used for input, output, and thinking modes, as well as some additional information, as defined by each of the providers. It was mostly useful to determine how to apportion the costs from the GCP interface to the Gemini models. This was the only instance where I couldn’t get a quick overview of what each model cost; all other providers gave me a list of either total costs or costs used for input and output.

With this, we can now move to analyzing the results. Be prepared to be surprised.

The results 🔗

Let’s start this section with a table on how the 80 models performed on the two problems. I ran each prompt for each model a total amount of 10 times, and considered a response to be correct if the expected answer was included anywhere in the output, even if the model later got a wrong answer.

The performance of the 80 models on the two problems.
Model	square	hexagon

chatgpt-4o-latest	0	2
gpt-3.5-turbo	0	0
gpt-4	0	0
gpt-4-0125-preview	0	0
gpt-4-1106-preview	0	0
gpt-4-turbo	0	0
gpt-4.1	0	3
gpt-4.1-mini	0	6
gpt-4.1-nano	0	0
gpt-4o	0	0
gpt-4o-mini	0	0
gpt-5	0	10
gpt-5-mini	0	9
gpt-5-nano	0	9
o1	0	4
o3	0	6
o3-mini	0	8
o4-mini	0	5
openai/gpt-oss-120b	0	8
openai/gpt-oss-120b-Turbo	0	1
openai/gpt-oss-20b	0	3

gemini/gemini-2.0-flash	0	0
gemini/gemini-2.0-flash-exp	0	0
gemini/gemini-2.0-flash-lite	0	1
gemini/gemini-2.5-flash	0	5
gemini/gemini-2.5-flash-lite-preview-09-2025	0	0
gemini/gemini-2.5-flash-preview-09-2025	2	2
gemini/gemini-2.5-pro	1	2
gemini/gemini-3-pro-preview	0	10
gemini/gemini-flash-latest	0	4
gemini/gemini-flash-lite-latest	0	0
google/gemma-3-12b-it	0	0
google/gemma-3-27b-it	0	0
google/gemma-3-4b-it	0	0

anthropic/claude-3-7-sonnet-latest	0	1
anthropic/claude-3-haiku-20240307	0	0
anthropic/claude-haiku-4-5-20251001	0	3
anthropic/claude-opus-4-0	0	2
anthropic/claude-opus-4-1-20250805	0	1
anthropic/claude-sonnet-4-0	0	1
anthropic/claude-sonnet-4-5	0	0

mistral/codestral-latest	0	0
mistral/ministral-3b-latest	0	0
mistral/ministral-8b-latest	0	0
mistral/mistral-large-latest	0	1
mistral/mistral-medium-latest	0	0
mistral/mistral-small-latest	0	0
mistral/open-mistral-nemo	0	0
mistral/pixtral-large-latest	0	0

meta-llama/Llama-3.2-3B-Instruct	0	0
meta-llama/Llama-3.3-70B-Instruct-Turbo	0	0
meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8	0	0
meta-llama/Llama-4-Scout-17B-16E-Instruct	0	0

deepseek-ai/DeepSeek-3.2	0	1
deepseek-ai/DeepSeek-R1-0528	0	4
deepseek-ai/DeepSeek-R1-0528-Turbo	0	5
deepseek-ai/DeepSeek-V3.1-Terminus	0	2

moonshotai/Kimi-K2-Instruct-0905	0	0
moonshotai/Kimi-K2-Thinking	0	10

grok-3-latest	0	0
grok-3-mini-latest	0	0
grok-4-fast	0	10
grok-4-fast-non-reasoning-latest	0	0
grok-code-fast-1	0	0

sonar	0	1
sonar-deep-research	0	0
sonar-pro	0	0
sonar-reasoning-pro	0	0

Qwen/Qwen3-235B-A22B-Thinking-2507	0	1
Qwen/Qwen3-32B	0	0
Qwen/Qwen3-Next-80B-A3B-Instruct	0	0

microsoft/WizardLM-2-8x22B	0	0
microsoft/phi-4	0	0

nvidia/Nemotron-3-Nano-30B-A3B	0	0

zai-org/GLM-4.6	0	9
zai-org/GLM-4.6V	0	0
zai-org/GLM-4.7	0	10

NousResearch/Hermes-3-Llama-3.1-405B	0	0
NousResearch/Hermes-3-Llama-3.1-70B	0	0

allenai/Olmo-3.1-32B-Instruct	0	0

We have a huge surprise! Models performed much better on the hexagon problem than on the square one.

I first checked the expected answer for the square grid. As shown in the previous article, this should be 16408, and that was the answer that I expected from the prompt description. I triple-checked the math and it was correct.

Next, I checked whether there was a new release of models in between running the square and the hexagon problems. I ran the harness for the square problem on and around January 17th and I ran the harness for the hexagon problem on and around January 20th. But, I couldn’t see anything relevant, so I ran some of these prompts again to check. The new results were similar.

Thus, I can conclude that the square grid is harder than the hexagonal one. In the next section, we’ll look at why this is the case, by checking some of the outputs.

Before doing that, let’s also look at these results in a graphical form:

Comparison of the models on the two problems

I only added labels for the interesting models in the plot.

Next, let’s also look at the costs. This time I was able to determine the cost of running each model on each problem, given that there was some time interval between the two problems and I also monitored the dashboards from the providers. Thus, we can build the following table:

The cost of running these models, for both problems and in total
Model	square	hexagon	total

chatgpt-4o-latest	0.338	0.234	0.572
gpt-3.5-turbo	0.010	0.004	0.014
gpt-4	0.150	0.219	0.369
gpt-4-0125-preview	0.287	0.265	0.552
gpt-4-1106-preview	0.232	0.197	0.429
gpt-4-turbo	0.434	0.195	0.629
gpt-4.1	0.426	0.230	0.656
gpt-4.1-mini	0.071	0.079	0.150
gpt-4.1-nano	0.017	0.014	0.031
gpt-4o	0.128	0.087	0.215
gpt-4o-mini	0.006	0.007	0.013
gpt-5	0.954	2.393	3.347
gpt-5-mini	0.134	0.369	0.503
gpt-5-nano	0.048	0.084	0.132
o1	9.082	10.227	19.309
o3	0.736	2.127	2.863
o3-mini	0.417	0.544	0.961
o4-mini	0.316	0.701	1.017
openai/gpt-oss-120b	0.010	0.010	0.020
openai/gpt-oss-120b-Turbo	0.050	0.060	0.110
openai/gpt-oss-20b	0.020	0.030	0.050

gemini/gemini-2.0-flash	0.005	0.004	0.009
gemini/gemini-2.0-flash-exp	0.000	0.000	0.000
gemini/gemini-2.0-flash-lite	0.004	0.004	0.008
gemini/gemini-2.5-flash	0.326	0.792	1.118
gemini/gemini-2.5-flash-lite-preview-09-2025	0.036	0.021	0.056
gemini/gemini-2.5-flash-preview-09-2025	0.326	0.557	0.883
gemini/gemini-2.5-pro	1.848	2.173	4.021
gemini/gemini-3-pro-preview	0.131	0.143	0.273
gemini/gemini-flash-latest	0.326	0.495	0.821
gemini/gemini-flash-lite-latest	0.038	0.016	0.054
google/gemma-3-12b-it	0.000	0.000	0.000
google/gemma-3-27b-it	0.000	0.000	0.000
google/gemma-3-4b-it	0.000	0.000	0.000

anthropic/claude-3-7-sonnet-latest	0.090	0.110	0.200
anthropic/claude-3-haiku-20240307	0.010	0.000	0.010
anthropic/claude-haiku-4-5-20251001	0.070	0.060	0.130
anthropic/claude-opus-4-0	1.300	0.790	2.090
anthropic/claude-opus-4-1-20250805	1.020	0.850	1.870
anthropic/claude-sonnet-4-0	0.200	0.180	0.380
anthropic/claude-sonnet-4-5	0.200	0.190	0.390

mistral/codestral-latest	0.048	0.092	0.140
mistral/ministral-3b-latest	0.014	0.004	0.019
mistral/ministral-8b-latest	0.072	0.006	0.078
mistral/mistral-large-latest	0.084	0.053	0.137
mistral/mistral-medium-latest	0.105	0.069	0.175
mistral/mistral-small-latest	0.025	0.012	0.037
mistral/open-mistral-nemo	0.037	0.000	0.037
mistral/pixtral-large-latest	0.059	0.061	0.120

meta-llama/Llama-3.2-3B-Instruct	0.000	0.000	0.000
meta-llama/Llama-3.3-70B-Instruct-Turbo	0.000	0.010	0.010
meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8	0.010	0.010	0.020
meta-llama/Llama-4-Scout-17B-16E-Instruct	0.000	0.010	0.010

deepseek-ai/DeepSeek-3.2	0.020	0.020	0.040
deepseek-ai/DeepSeek-R1-0528	0.520	0.730	1.250
deepseek-ai/DeepSeek-R1-0528-Turbo	0.630	0.880	1.510
deepseek-ai/DeepSeek-V3.1-Terminus	0.030	0.030	0.060

moonshotai/Kimi-K2-Instruct-0905	0.070	0.060	0.130
moonshotai/Kimi-K2-Thinking	0.610	1.250	1.860

grok-3-latest	0.467	0.204	0.671
grok-3-mini-latest	0.036	0.025	0.061
grok-4-fast	0.048	0.084	0.132
grok-4-fast-non-reasoning-latest	0.005	0.001	0.006
grok-code-fast-1	0.795	0.998	1.794

sonar	0.060	0.050	0.110
sonar-deep-research	11.950	11.750	23.700
sonar-pro	0.140	0.180	0.320
sonar-reasoning-pro	0.210	0.150	0.360

Qwen/Qwen3-235B-A22B-Thinking-2507	0.640	0.780	1.420
Qwen/Qwen3-32B	0.050	0.040	0.090
Qwen/Qwen3-Next-80B-A3B-Instruct	0.090	0.100	0.190

microsoft/WizardLM-2-8x22B	0.020	0.000	0.020
microsoft/phi-4	0.000	0.000	0.000

nvidia/Nemotron-3-Nano-30B-A3B	0.500	0.550	1.050

zai-org/GLM-4.6	0.370	1.120	1.490
zai-org/GLM-4.6V	0.150	0.140	0.290
zai-org/GLM-4.7	0.480	1.260	1.740

NousResearch/Hermes-3-Llama-3.1-405B	0.010	0.000	0.010
NousResearch/Hermes-3-Llama-3.1-70B	0.000	0.010	0.010

allenai/Olmo-3.1-32B-Instruct	0.070	0.080	0.150

We can build these graphs showing the performance of the models based on their score. First, looking at just the square problem:

Next, looking at just the hexagon problem:

Cost/performance for the hexagon problem

Finally, we can build a graph where we have the total cost and the average performance on the two problems.

Total cost for solving the problems and the performance

We can also look at the Pareto frontier:

As a list, the models that are on the frontier are:

grok-4-fast (score: 10.0, cost: $0.132)
openai/gpt-oss-120b (score: 8.0, cost: $0.02)
gemini/gemini-2.0-flash-lite (score: 1.0, cost: $0.008)

Grok gets a perfect score on the hexagons problem while Gemini models are very cheap. I plan to run more experiments on these problems in the future, to get more data here.

Looking at some LLM outputs 🔗

Overall, the experiments produced quite a lot of output:

[...] λ wc prompts_square/*/* | grep total
 190481 1411059 7482216 total
[...] λ wc prompts_hexagon/*/* | grep total
 169599 1245219 6745013 total

Rather than going through all of the file and listing mistakes from each, I just sampled a significant number of them and analyzed those outputs.

Looking at the square problem, most models get the idea than in order to solve the problem they need to locate the ring where 2026 is on. Models get that they need to determine the coordinates of corners of the rings – some look at the highest number on ring, some look at the sizes of each ring. But, when the models get to determine the neighbors of 2026 they run into troubles:

(22, -21) → 2024

(23, -21) → 2207

(24, -21) → 2398

(22, -22) → 2025

(24, -22) → 2399

(22, -23) → 2026

(23, -23) → 2025

(24, -23) → 2209

Even when not listing the coordinates, the neighbors are incorrect:

Now list the 8 neighbor values:

2212

2211

2210

2027

2209

1850

1849

1848

In some cases, the model attempts to build the grid, but still gets it wrong:

2034 2033 2032
2027 2026 2031
2028 2029 2030

A few models perform a math mistake way before this point, operating with numbers that they shouldn’t have:

Corner numbers:

Top left corner = 522^2 - 21 * 1 = 271441 - 21 = 271420

Top right corner = 522^2 - 21 * 3 = 271441 - 63 = 271378

Bottom left corner = 522^2 - 21 * 5 = 271441 - 105 = 271336

Bottom right corner = 522^2 - 21 * 7 = 271441 - 147 = 271294

Midpoint numbers:

Directly above = 271441 - 42 = 271399

Directly below = 271441 + 42 = 271483

Directly left = 271441 - 84 = 271357

Directly right = 271441 + 84 = 271525

Finally, there are models that don’t use the grid as given in the prompt. They either add a new number and change orientation:

..22 21 20 19 18
23..  6  1  2 17
24..  7  0  3 16
25..  8  9  4 15
26 27 10 11 12 13

or don’t understand the grid concept:

... 5 4 3 ...
... 6 1 2 ...
... 7 8 9 ...
... 12 11 10 ...
... 13 14 15 ...

Some even forget that we are talking about square grids.

Finally, some models hallucinate an entire neighborhood around 2026:

... 1992 1991 1990 1989 1988 1987
... 1993 1744 1743 1742 1741 1740 1986
... 1994 1745 1500 1499 1498 1739 1985
... 1995 1746 1501 1264 1263 1497 1984
... 1996 1747 1502 1265 2026 1262 1983
... 1997 1748 1503 1266 1267 1496 1982
... 1998 1749 1504 1505 1506 1495 1981

Moving on to the hexagonal grid problem, we see similar patterns. The models are able to locate the patterns for the corners of the rings and then can locate the position of 2026. There are still mistakes in positioning the numbers, although less frequent.

2027  2028  ...
2025  2026  2010
2019  2020  2021  2022
2018  2012  2011  2015
2017  2016  2025  2024  ...

There are still cases when the model hallucinates the starting grid:

15 16 17 18 ...
4 1 2 19 20 ...
14 5 2026 7 8 ...
13 6 3 9 10 ...
12 11 10 ...

And sometimes they get into even crazier outputs:

                 28 27 26 25 24
      1   3   5   7   9  23
   2   4   6   8  10  22
    12  14  16  18  20
 11  13  15  17  19  21
    30  32  34  36  38  40
      29  31  33  35  37
        2021 2019 2017 2015
          2022 2020 2018 2016
            2024 2026 2030
              2023 2025 2027
                2028 2032
                  2029 2031

Looking at the helper code that they generate, some models are able to generate code using 3D coordinates for the hexagonal grid:

DIRECTIONS = [
    (1, -1, 0),   # East
    (1, 0, -1),   # NE
    (0, 1, -1),   # NW
    (-1, 1, 0),   # West
    (-1, 0, 1),   # SW
    (0, -1, 1),   # SE
]

One thing that the models tried to do is to determine a formula for the sum:

Let $n$ be a number in the hexagonal grid. Then the sum of its 6 neighbors is $6n - 6 + 6 = 6n$. The sum of the 6 neighbors of $n$ is $6n - 6$.

$6 \times 2026 - 6 = 12156 - 6 = 12150$.

However, this formula is not correct.

This was not correct, but, in several cases the models reach the correct one:

Alternatively, for a number $N$ on a side of the hexagon (and not a corner), the sum of its 6 neighbors is $6N + 12$.

$6(2026) + 12 = 12156 + 12 = 12168$.

I haven’t sampled an answer on the square problems that does something similar, and this might be a reason why the hexagonal grid problem seems to be easier to solve. It is possible that I didn’t get a formula in the square case also because the formula that a model might get is not valid near the corners.

Another possible reason for why the hexagonal grid was easier to solve might be that I have included more numbers in the prompt than in the square case. I’m curious how much this matters.

One more reason for why the hexagon problem is easier to solve is that the models are cheating: the models are able to look up OEIS sequences:

Searching OEIS or known data, we find:

This is a more common occurence for the hexagonal grid than for the square one. Under the assumption that models that don’t look up the answer make mistakes during computation, this could be another reason for why the hexagon problems has more correct solvers.

Note though that some of the sequences looked up are irrelevant: some are for modular arithmetic, some involve matrices on the dihedral group $D_4$ and so on.

However, models are able to find the answer for the hexagonal grid, but have some issues on the square one. There are two important sequences that the models have discovered, A174344 and A274923. The first one is rotating in the opposite direction, and both of them start with a 0, instead of 1.

Rotating the grid might make the models not be able to cheat, at a first glance. However, a simple rotation cannot change the position of any number, so the answers would be the same. Thus, even if the model cheats, it will still get a correct answer. Introducing a 0 at the start might change things, but before testing that I’ll have to check if there really are no hexagonal grids that start with a 0 in OEIS (and that there are no square grids that start from 1).

Conclusions 🔗

I got so many unexpected results while running this experiment. First, I was thinking that the hexagon problem is harder, but models scored better on it. Then, models that people claimed to be better turned out to not score as high.

More importantly, I got models that cheated, as well as models that found that the answer can be given by a simple formula, after checking that the formula applies.

I have more experiments to run now, for a later time. Until then, I’m publishing the entire code harness, results and the notebook used to generate the plots on my demos repo. Right now, this is the first entry there, but I intend to add more.

The test harness 🔗

The results 🔗

Looking at some LLM outputs 🔗

Conclusions 🔗

Comments: