Testing 80 LLMs on spatial reasoning on grids
In the last article I introduced two different problems that I was curious how LLMs would answer. For that article, I ran a simple Kaggle experiment, but now it’s time for a more comprehensive study.
To summarize, the problems involve laying out numbers in a grid, locating the position of 2026, and then summing out its neighbors. The only difference between the two problems is that the first problem uses a square grid (8 neighbors) whereas the second one uses a hexagonal grid (6 neighbors). The full content of the prompts is in the last article.
Before continuing, how do you think the models did? Which of the problems was more difficult? The results will surprise you…
The remainder of the article is split in 4 parts. I’ll start with a description of the test harness, then discuss the experimental results. Next, there will be a short section where I analyze some of the outputs. Finally, there is a conclusions section, summarizing this article and discussing the next steps. You can use any of these links to skip to the corresponding section.
The test harness 🔗
As the Kaggle experiment only allowed me to test a small number of
models (there are more available now), and only for one single iteration (I
was told that this will change in the future), I implemented a test harness
around the llm library.
Since I use NixOS, I first need to define the shell.nix file.
Unlike last year, most of the packages are already in the
nixpkgs repo, so the setup is simpler:
let
pkgs = import <nixpkgs> {};
in
pkgs.mkShell {
packages = with pkgs; [
(python3.withPackages (pypkgs: with pypkgs; [
llm
llm-gemini
llm-perplexity
llm-mistral
llm-grok
llm-anthropic
]))
];
shellHook = ''
export PS1="[\[\033[01;32m\]nix-shell\[\033[00m\]:\W] \[\033[01;32m\]λ\[\033[00m\] "
'';
}This gives me the following versions:
[...] λ pip list | grep llm
llm 0.27.1
llm-anthropic 0.20
llm-gemini 0.27
llm-grok 1.3.0
llm-mistral 0.15
llm-perplexity 2025.10.0Note that these were the versions at the time of writing this article, which is a few weeks after actually running the experiments. It is thus possible that some of these versions have changed – one of my tasks for this year is to migrate to a reproducible experiment harness for all the experiments mentioned in this blog.
With the above setup and some additional changes in
extra-openai-models.yaml, as per the llm documentation, we
can experiment with models from various providers.
But, I don’t want to run across all of the model names that become available.
There are some models to which I don’t have access yet – I need to spend more
on the provider until I get to higher access levels, just like I couldn’t
test o1 in the experiments from the last year. There are models
that show up as both the model name and several aliases. There are some models
that are already deprecated: listed as such in the documentation or already
the API refuses to answer calls for that specific slug. More importantly,
there are models that are not suitable for the task – they expect a different
modality. There are a few Gemma models I couldn’t run due to a bug
in the llm library. Finally, I also wanted to exclude some models that are
either clones of existing ones, very expensive, or for any other reasons. So,
I have this big global in the script:
_forbidden_models = set({
# Missing access to model
"gpt-4-32k",
...
# aliases
"gpt-3.5-turbo-16k", # gpt-3.5-turbo
...
"grok-3-mini-fast-latest", # grok-3-mini-latest
# deprecated (+ aliases)
"o1-preview",
...
"gemini/gemini-2.5-pro-preview-06-05",
# Not present in the official documentation (deprecated)
"gemini/gemini-pro",
...
"gemini/gemini-2.5-flash-preview-04-17",
# Task not suitable (e.g., models are audio)
"gpt-4o-audio-preview",
...
"grok-2-vision-latest",
# Media resolution bug (https://github.com/simonw/llm-gemini/issues/116)
"gemini/gemma-3-1b-it",
"gemini/gemma-3-4b-it",
"gemini/gemma-3-12b-it",
"gemini/gemma-3-27b-it",
"gemini/gemma-3n-e4b-it",
# Other reasons
"r1-1776", # DeepSeek model, different provider
"grok-2-latest", # Old and costly
"grok-4-latest", # Very slow, and costly ($1.16 for only 4 answers in over 12 hours)
})At this point, I am left with exactly 80 models I can run on.
The test harness uses click and rich to build a nicer
interface. This allows me to define command line arguments so I can run the
harness with only a subset of the models, and see the progress in a nicer way
– last year I was tailing log files, I don’t want to do that again, ever.
Hence, the outer layer of the harness looks like the following:
_logger = logging.getLogger(__name__)
@click.command()
@click.option(
"--log_level",
default=logging.INFO,
help="Log verbosity level.",
)
@click.option(
"-p",
"--prompts",
type=click.File("r"),
help="File(s) with prompt text.",
multiple=True,
)
@click.option(
"-m",
"--models",
type=str,
metavar="MODEL",
help="Models to query.",
multiple=True,
)
@click.option(
"-n",
"--rounds",
type=int,
metavar="ROUNDS",
help="Number of rounds of querying.",
default=10,
)
def main(
log_level: int,
prompts: Collection[click.File],
models: Collection[str],
rounds: int,
) -> None:
"""The main entry point."""
_setup_logging(log_level)
_logger.info("Prompt files: %s", prompts)
_logger.info("Models asked for: %s", models)
_logger.info("Number of rounds: %d", rounds)
if not prompts:
_logger.debug("No prompt given, will just list the models..")
click.echo("Available models:")
for model in llm.get_models():
if model.model_id not in _forbidden_models:
click.echo(f"\t{model.model_id}")
raise SystemExit
if models:
models = [llm.get_model(m) for m in models]
else:
models = []
for model in llm.get_models():
if model.model_id not in _forbidden_models:
models.append(model)
_run_prompts(prompts, models, rounds)
def _setup_logging(level: int = logging.INFO) -> None:
"""Configures the global logging for the entire run."""
global _logger
handler = rich_logging.RichHandler(
show_level=False, show_time=False, show_path=False
)
handler.setFormatter(
logging.Formatter(
fmt="%(asctime)s.%(msecs)d - %(levelname).1s - %(message)s",
datefmt="%m/%d/%Y %H:%M:%S",
)
)
_logger.addHandler(handler)
_logger.setLevel(level)
if __name__ == "__main__":
main()The main part is in _run_prompts. Having a list of prompts and
models, I set up the entire TUI (well, just progress bars), and then call the
proper APIs, check the answer and record it:
def _run_prompts(
prompts: Collection[click.File],
models: Collection[llm.Model],
rounds: int,
) -> None:
"""Runs all prompts on all models."""
now = datetime.datetime.today()
root_folder = pathlib.Path(now.strftime("%y%m%d-%H%M%S"))
num_model_runs = len(models) * rounds
num_tasks = len(prompts) * num_model_runs
with rich_progress.Progress(
rich_progress.TextColumn("{task.description}"),
rich_progress.TaskProgressColumn(),
rich_progress.BarColumn(None),
rich_progress.MofNCompleteColumn(),
rich_progress.TextColumn("•"),
rich_progress.TimeElapsedColumn(),
rich_progress.TextColumn("/"),
rich_progress.TimeRemainingColumn(),
) as pbar:
total_task = pbar.add_task("Total run progress", total=num_tasks)
prompt_task = pbar.add_task("placeholder", total=num_model_runs)
model_task = pbar.add_task("placeholder", total=rounds)
score_task = pbar.add_task("Model score on task", total=rounds)
for prompt in prompts:
prompt_name = _canonicalize(prompt.name)
prompt_folder = root_folder / prompt_name
pbar.reset(prompt_task, description=prompt_name)
prompt_text, expected_answer = _read_prompt(prompt)
for model in models:
model_name = _canonicalize(model.model_id)
model_folder = prompt_folder / model_name
model_folder.mkdir(parents=True)
pbar.reset(model_task, description=model_name)
pbar.reset(score_task)
correct = 0
pbar.reset(score_task)
for i in range(rounds):
out = model_folder / f"{i}"
if _get_answer(prompt_text, expected_answer, model, out):
correct += 1
pbar.update(score_task, advance=1)
pbar.update(model_task, advance=1)
pbar.update(prompt_task, advance=1)
pbar.update(total_task, advance=1)
_logger.info(
f"{correct}/{rounds} for {model_name} on {prompt_name}"
)
(model_folder / "score").write_text(f"{correct}\n")The TUI logs the score for each model, so it is easy for me to collect the
performance metrics. I also store the score in a simple score file, to make
it easier to merge results from completed runs.
The prompts are stored in two separate files, such that the last line contains the answer. That is, this is the prompt for the square grid problem
The natural numbers are written on a spiral laid out on a square grid:
... 5 4 3 ...
... 6 1 2 ...
... 7 8 9 ...
What is the sum of all 8 neighbors of the number 2026?
16408
And this is the prompt for the hexagonal grid problem:
The natural numbers are written on a spiral laid out on a hexagonal grid:
... 14 4 3 10 ...
.. 15 5 1 2 9 ...
... 16 6 7 8 ...
What is the sum of all 6 neighbors of the number 2026?
12168
I then need a helper that reads these files, extracts the correct prompt and the answer as two separate variables:
def _read_prompt(prompt: click.File) -> Tuple[str, int]:
"""Parses a prompt file, extracting the prompt and the answer."""
lines = [l.rstrip() for l in prompt.readlines()]
return "\n".join(lines[:-1]), int(lines[-1])Since the output of every model is stored in a directory named after the model
slug, which may contain characters that are not valid for a path, I have this
simple helper which just replaces / with _:
def _canonicalize(name: str) -> str:
"""Makes `name` be usable as a filename/task."""
return name.replace("/", "_")The last thing left to do is to actually prompt the model and check the answer. Rather than checking the answer manually, like I did in the last year, I attempted to parse the answer inline and reconstruct the numbers contained in the output. The expectation was that the last number from the output is the answer that the LLM gives, but this was naive:
def _get_answer(
prompt: str,
expected: int,
model: llm.Model,
log: pathlib.Path,
) -> bool:
"""Queries `model` on `prompt`, expecting `expected`, loging to `log`."""
wait_time = 1
while True:
try:
with log.open('w') as f:
response = model.prompt(prompt)
last_number, current_number = 0, 0
for part in response:
f.write(part)
for c in part:
if c in "0123456789":
current_number = 10 * current_number + int(c)
elif current_number:
last_number, current_number = current_number, 0
if current_number:
last_number = current_number
f.write("\n----\n")
f.write(f"{response.usage()}\n")
return last_number == expected
except Exception as e:
_logger.exception(
"Model %s failed, wait %d seconds", model.model_id, wait_time
)
time.sleep(wait_time)
wait_time *= 2There are two reasons why the above reconstruction scheme does not work.
First, some models put references at the end of the output, using a [1] <link> format. Some models might put additional content, that also contains
numbers. But, more important, some models use a thousands separator, while
others don’t. In the end, the number that I get in the log, from the above
code, is just a lower bound on how well the model performs. I still have to
inspect the output files to see if they contain the correct answer anywhere.
But, this can be done with a simple grep:
[...] λ rg '12168|12,168' results/prompts_hexagon/ -l | cut -d/ -f3 | sort | uniq -c
...Besides doing an exponential backoff in _get_answer, I also record
the output of response.usage() at the end of the model output. This contains
the number of tokens used for input, output, and thinking modes, as well as
some additional information, as defined by each of the providers. It was
mostly useful to determine how to apportion the costs from the GCP interface
to the Gemini models. This was the only instance where I couldn’t get a
quick overview of what each model cost; all other providers gave me a list of
either total costs or costs used for input and output.
With this, we can now move to analyzing the results. Be prepared to be surprised.
The results 🔗
Let’s start this section with a table on how the 80 models performed on the two problems. I ran each prompt for each model a total amount of 10 times, and considered a response to be correct if the expected answer was included anywhere in the output, even if the model later got a wrong answer.
| Model | square | hexagon |
|---|---|---|
| chatgpt-4o-latest | 0 | 2 |
| gpt-3.5-turbo | 0 | 0 |
| gpt-4 | 0 | 0 |
| gpt-4-0125-preview | 0 | 0 |
| gpt-4-1106-preview | 0 | 0 |
| gpt-4-turbo | 0 | 0 |
| gpt-4.1 | 0 | 3 |
| gpt-4.1-mini | 0 | 6 |
| gpt-4.1-nano | 0 | 0 |
| gpt-4o | 0 | 0 |
| gpt-4o-mini | 0 | 0 |
| gpt-5 | 0 | 10 |
| gpt-5-mini | 0 | 9 |
| gpt-5-nano | 0 | 9 |
| o1 | 0 | 4 |
| o3 | 0 | 6 |
| o3-mini | 0 | 8 |
| o4-mini | 0 | 5 |
| openai/gpt-oss-120b | 0 | 8 |
| openai/gpt-oss-120b-Turbo | 0 | 1 |
| openai/gpt-oss-20b | 0 | 3 |
| gemini/gemini-2.0-flash | 0 | 0 |
| gemini/gemini-2.0-flash-exp | 0 | 0 |
| gemini/gemini-2.0-flash-lite | 0 | 1 |
| gemini/gemini-2.5-flash | 0 | 5 |
| gemini/gemini-2.5-flash-lite-preview-09-2025 | 0 | 0 |
| gemini/gemini-2.5-flash-preview-09-2025 | 2 | 2 |
| gemini/gemini-2.5-pro | 1 | 2 |
| gemini/gemini-3-pro-preview | 0 | 10 |
| gemini/gemini-flash-latest | 0 | 4 |
| gemini/gemini-flash-lite-latest | 0 | 0 |
| google/gemma-3-12b-it | 0 | 0 |
| google/gemma-3-27b-it | 0 | 0 |
| google/gemma-3-4b-it | 0 | 0 |
| anthropic/claude-3-7-sonnet-latest | 0 | 1 |
| anthropic/claude-3-haiku-20240307 | 0 | 0 |
| anthropic/claude-haiku-4-5-20251001 | 0 | 3 |
| anthropic/claude-opus-4-0 | 0 | 2 |
| anthropic/claude-opus-4-1-20250805 | 0 | 1 |
| anthropic/claude-sonnet-4-0 | 0 | 1 |
| anthropic/claude-sonnet-4-5 | 0 | 0 |
| mistral/codestral-latest | 0 | 0 |
| mistral/ministral-3b-latest | 0 | 0 |
| mistral/ministral-8b-latest | 0 | 0 |
| mistral/mistral-large-latest | 0 | 1 |
| mistral/mistral-medium-latest | 0 | 0 |
| mistral/mistral-small-latest | 0 | 0 |
| mistral/open-mistral-nemo | 0 | 0 |
| mistral/pixtral-large-latest | 0 | 0 |
| meta-llama/Llama-3.2-3B-Instruct | 0 | 0 |
| meta-llama/Llama-3.3-70B-Instruct-Turbo | 0 | 0 |
| meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8 | 0 | 0 |
| meta-llama/Llama-4-Scout-17B-16E-Instruct | 0 | 0 |
| deepseek-ai/DeepSeek-3.2 | 0 | 1 |
| deepseek-ai/DeepSeek-R1-0528 | 0 | 4 |
| deepseek-ai/DeepSeek-R1-0528-Turbo | 0 | 5 |
| deepseek-ai/DeepSeek-V3.1-Terminus | 0 | 2 |
| moonshotai/Kimi-K2-Instruct-0905 | 0 | 0 |
| moonshotai/Kimi-K2-Thinking | 0 | 10 |
| grok-3-latest | 0 | 0 |
| grok-3-mini-latest | 0 | 0 |
| grok-4-fast | 0 | 10 |
| grok-4-fast-non-reasoning-latest | 0 | 0 |
| grok-code-fast-1 | 0 | 0 |
| sonar | 0 | 1 |
| sonar-deep-research | 0 | 0 |
| sonar-pro | 0 | 0 |
| sonar-reasoning-pro | 0 | 0 |
| Qwen/Qwen3-235B-A22B-Thinking-2507 | 0 | 1 |
| Qwen/Qwen3-32B | 0 | 0 |
| Qwen/Qwen3-Next-80B-A3B-Instruct | 0 | 0 |
| microsoft/WizardLM-2-8x22B | 0 | 0 |
| microsoft/phi-4 | 0 | 0 |
| nvidia/Nemotron-3-Nano-30B-A3B | 0 | 0 |
| zai-org/GLM-4.6 | 0 | 9 |
| zai-org/GLM-4.6V | 0 | 0 |
| zai-org/GLM-4.7 | 0 | 10 |
| NousResearch/Hermes-3-Llama-3.1-405B | 0 | 0 |
| NousResearch/Hermes-3-Llama-3.1-70B | 0 | 0 |
| allenai/Olmo-3.1-32B-Instruct | 0 | 0 |
We have a huge surprise! Models performed much better on the hexagon problem than on the square one.
I first checked the expected answer for the square grid. As shown in the previous article, this should be 16408, and that was the answer that I expected from the prompt description. I triple-checked the math and it was correct.
Next, I checked whether there was a new release of models in between running the square and the hexagon problems. I ran the harness for the square problem on and around January 17th and I ran the harness for the hexagon problem on and around January 20th. But, I couldn’t see anything relevant, so I ran some of these prompts again to check. The new results were similar.
Thus, I can conclude that the square grid is harder than the hexagonal one. In the next section, we’ll look at why this is the case, by checking some of the outputs.
Before doing that, let’s also look at these results in a graphical form:
I only added labels for the interesting models in the plot.
Next, let’s also look at the costs. This time I was able to determine the cost of running each model on each problem, given that there was some time interval between the two problems and I also monitored the dashboards from the providers. Thus, we can build the following table:
| Model | square | hexagon | total |
|---|---|---|---|
| chatgpt-4o-latest | 0.338 | 0.234 | 0.572 |
| gpt-3.5-turbo | 0.010 | 0.004 | 0.014 |
| gpt-4 | 0.150 | 0.219 | 0.369 |
| gpt-4-0125-preview | 0.287 | 0.265 | 0.552 |
| gpt-4-1106-preview | 0.232 | 0.197 | 0.429 |
| gpt-4-turbo | 0.434 | 0.195 | 0.629 |
| gpt-4.1 | 0.426 | 0.230 | 0.656 |
| gpt-4.1-mini | 0.071 | 0.079 | 0.150 |
| gpt-4.1-nano | 0.017 | 0.014 | 0.031 |
| gpt-4o | 0.128 | 0.087 | 0.215 |
| gpt-4o-mini | 0.006 | 0.007 | 0.013 |
| gpt-5 | 0.954 | 2.393 | 3.347 |
| gpt-5-mini | 0.134 | 0.369 | 0.503 |
| gpt-5-nano | 0.048 | 0.084 | 0.132 |
| o1 | 9.082 | 10.227 | 19.309 |
| o3 | 0.736 | 2.127 | 2.863 |
| o3-mini | 0.417 | 0.544 | 0.961 |
| o4-mini | 0.316 | 0.701 | 1.017 |
| openai/gpt-oss-120b | 0.010 | 0.010 | 0.020 |
| openai/gpt-oss-120b-Turbo | 0.050 | 0.060 | 0.110 |
| openai/gpt-oss-20b | 0.020 | 0.030 | 0.050 |
| gemini/gemini-2.0-flash | 0.005 | 0.004 | 0.009 |
| gemini/gemini-2.0-flash-exp | 0.000 | 0.000 | 0.000 |
| gemini/gemini-2.0-flash-lite | 0.004 | 0.004 | 0.008 |
| gemini/gemini-2.5-flash | 0.326 | 0.792 | 1.118 |
| gemini/gemini-2.5-flash-lite-preview-09-2025 | 0.036 | 0.021 | 0.056 |
| gemini/gemini-2.5-flash-preview-09-2025 | 0.326 | 0.557 | 0.883 |
| gemini/gemini-2.5-pro | 1.848 | 2.173 | 4.021 |
| gemini/gemini-3-pro-preview | 0.131 | 0.143 | 0.273 |
| gemini/gemini-flash-latest | 0.326 | 0.495 | 0.821 |
| gemini/gemini-flash-lite-latest | 0.038 | 0.016 | 0.054 |
| google/gemma-3-12b-it | 0.000 | 0.000 | 0.000 |
| google/gemma-3-27b-it | 0.000 | 0.000 | 0.000 |
| google/gemma-3-4b-it | 0.000 | 0.000 | 0.000 |
| anthropic/claude-3-7-sonnet-latest | 0.090 | 0.110 | 0.200 |
| anthropic/claude-3-haiku-20240307 | 0.010 | 0.000 | 0.010 |
| anthropic/claude-haiku-4-5-20251001 | 0.070 | 0.060 | 0.130 |
| anthropic/claude-opus-4-0 | 1.300 | 0.790 | 2.090 |
| anthropic/claude-opus-4-1-20250805 | 1.020 | 0.850 | 1.870 |
| anthropic/claude-sonnet-4-0 | 0.200 | 0.180 | 0.380 |
| anthropic/claude-sonnet-4-5 | 0.200 | 0.190 | 0.390 |
| mistral/codestral-latest | 0.048 | 0.092 | 0.140 |
| mistral/ministral-3b-latest | 0.014 | 0.004 | 0.019 |
| mistral/ministral-8b-latest | 0.072 | 0.006 | 0.078 |
| mistral/mistral-large-latest | 0.084 | 0.053 | 0.137 |
| mistral/mistral-medium-latest | 0.105 | 0.069 | 0.175 |
| mistral/mistral-small-latest | 0.025 | 0.012 | 0.037 |
| mistral/open-mistral-nemo | 0.037 | 0.000 | 0.037 |
| mistral/pixtral-large-latest | 0.059 | 0.061 | 0.120 |
| meta-llama/Llama-3.2-3B-Instruct | 0.000 | 0.000 | 0.000 |
| meta-llama/Llama-3.3-70B-Instruct-Turbo | 0.000 | 0.010 | 0.010 |
| meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8 | 0.010 | 0.010 | 0.020 |
| meta-llama/Llama-4-Scout-17B-16E-Instruct | 0.000 | 0.010 | 0.010 |
| deepseek-ai/DeepSeek-3.2 | 0.020 | 0.020 | 0.040 |
| deepseek-ai/DeepSeek-R1-0528 | 0.520 | 0.730 | 1.250 |
| deepseek-ai/DeepSeek-R1-0528-Turbo | 0.630 | 0.880 | 1.510 |
| deepseek-ai/DeepSeek-V3.1-Terminus | 0.030 | 0.030 | 0.060 |
| moonshotai/Kimi-K2-Instruct-0905 | 0.070 | 0.060 | 0.130 |
| moonshotai/Kimi-K2-Thinking | 0.610 | 1.250 | 1.860 |
| grok-3-latest | 0.467 | 0.204 | 0.671 |
| grok-3-mini-latest | 0.036 | 0.025 | 0.061 |
| grok-4-fast | 0.048 | 0.084 | 0.132 |
| grok-4-fast-non-reasoning-latest | 0.005 | 0.001 | 0.006 |
| grok-code-fast-1 | 0.795 | 0.998 | 1.794 |
| sonar | 0.060 | 0.050 | 0.110 |
| sonar-deep-research | 11.950 | 11.750 | 23.700 |
| sonar-pro | 0.140 | 0.180 | 0.320 |
| sonar-reasoning-pro | 0.210 | 0.150 | 0.360 |
| Qwen/Qwen3-235B-A22B-Thinking-2507 | 0.640 | 0.780 | 1.420 |
| Qwen/Qwen3-32B | 0.050 | 0.040 | 0.090 |
| Qwen/Qwen3-Next-80B-A3B-Instruct | 0.090 | 0.100 | 0.190 |
| microsoft/WizardLM-2-8x22B | 0.020 | 0.000 | 0.020 |
| microsoft/phi-4 | 0.000 | 0.000 | 0.000 |
| nvidia/Nemotron-3-Nano-30B-A3B | 0.500 | 0.550 | 1.050 |
| zai-org/GLM-4.6 | 0.370 | 1.120 | 1.490 |
| zai-org/GLM-4.6V | 0.150 | 0.140 | 0.290 |
| zai-org/GLM-4.7 | 0.480 | 1.260 | 1.740 |
| NousResearch/Hermes-3-Llama-3.1-405B | 0.010 | 0.000 | 0.010 |
| NousResearch/Hermes-3-Llama-3.1-70B | 0.000 | 0.010 | 0.010 |
| allenai/Olmo-3.1-32B-Instruct | 0.070 | 0.080 | 0.150 |
We can build these graphs showing the performance of the models based on their score. First, looking at just the square problem:
Next, looking at just the hexagon problem:
Finally, we can build a graph where we have the total cost and the average performance on the two problems.
We can also look at the Pareto frontier:
As a list, the models that are on the frontier are:
grok-4-fast(score: 10.0, cost: $0.132)openai/gpt-oss-120b(score: 8.0, cost: $0.02)gemini/gemini-2.0-flash-lite(score: 1.0, cost: $0.008)
Grok gets a perfect score on the hexagons problem while Gemini models are very cheap. I plan to run more experiments on these problems in the future, to get more data here.
Looking at some LLM outputs 🔗
Overall, the experiments produced quite a lot of output:
[...] λ wc prompts_square/*/* | grep total
190481 1411059 7482216 total
[...] λ wc prompts_hexagon/*/* | grep total
169599 1245219 6745013 totalRather than going through all of the file and listing mistakes from each, I just sampled a significant number of them and analyzed those outputs.
Looking at the square problem, most models get the idea than in order to solve the problem they need to locate the ring where 2026 is on. Models get that they need to determine the coordinates of corners of the rings – some look at the highest number on ring, some look at the sizes of each ring. But, when the models get to determine the neighbors of 2026 they run into troubles:
- (22, -21) → 2024
- (23, -21) → 2207
- (24, -21) → 2398
- (22, -22) → 2025
- (24, -22) → 2399
- (22, -23) → 2026
- (23, -23) → 2025
- (24, -23) → 2209
Even when not listing the coordinates, the neighbors are incorrect:
Now list the 8 neighbor values:
- 2212
- 2211
- 2210
- 2027
- 2209
- 1850
- 1849
- 1848
In some cases, the model attempts to build the grid, but still gets it wrong:
2034 2033 2032 2027 2026 2031 2028 2029 2030
A few models perform a math mistake way before this point, operating with numbers that they shouldn’t have:
Corner numbers:
- Top left corner = 522^2 - 21 * 1 = 271441 - 21 = 271420
- Top right corner = 522^2 - 21 * 3 = 271441 - 63 = 271378
- Bottom left corner = 522^2 - 21 * 5 = 271441 - 105 = 271336
- Bottom right corner = 522^2 - 21 * 7 = 271441 - 147 = 271294
Midpoint numbers:
- Directly above = 271441 - 42 = 271399
- Directly below = 271441 + 42 = 271483
- Directly left = 271441 - 84 = 271357
- Directly right = 271441 + 84 = 271525
Finally, there are models that don’t use the grid as given in the prompt. They either add a new number and change orientation:
..22 21 20 19 18 23.. 6 1 2 17 24.. 7 0 3 16 25.. 8 9 4 15 26 27 10 11 12 13
or don’t understand the grid concept:
... 5 4 3 ... ... 6 1 2 ... ... 7 8 9 ... ... 12 11 10 ... ... 13 14 15 ...
Some even forget that we are talking about square grids.
1 2 3 4 5 6 7 8 9 10
Finally, some models hallucinate an entire neighborhood around 2026:
... 1992 1991 1990 1989 1988 1987 ... 1993 1744 1743 1742 1741 1740 1986 ... 1994 1745 1500 1499 1498 1739 1985 ... 1995 1746 1501 1264 1263 1497 1984 ... 1996 1747 1502 1265 2026 1262 1983 ... 1997 1748 1503 1266 1267 1496 1982 ... 1998 1749 1504 1505 1506 1495 1981
Moving on to the hexagonal grid problem, we see similar patterns. The models are able to locate the patterns for the corners of the rings and then can locate the position of 2026. There are still mistakes in positioning the numbers, although less frequent.
2027 2028 ... 2025 2026 2010 2019 2020 2021 2022 2018 2012 2011 2015 2017 2016 2025 2024 ...
There are still cases when the model hallucinates the starting grid:
15 16 17 18 ... 4 1 2 19 20 ... 14 5 2026 7 8 ... 13 6 3 9 10 ... 12 11 10 ...
And sometimes they get into even crazier outputs:
28 27 26 25 24 1 3 5 7 9 23 2 4 6 8 10 22 12 14 16 18 20 11 13 15 17 19 21 30 32 34 36 38 40 29 31 33 35 37 2021 2019 2017 2015 2022 2020 2018 2016 2024 2026 2030 2023 2025 2027 2028 2032 2029 2031
Looking at the helper code that they generate, some models are able to generate code using 3D coordinates for the hexagonal grid:
DIRECTIONS = [
(1, -1, 0), # East
(1, 0, -1), # NE
(0, 1, -1), # NW
(-1, 1, 0), # West
(-1, 0, 1), # SW
(0, -1, 1), # SE
]One thing that the models tried to do is to determine a formula for the sum:
Let \(n\) be a number in the hexagonal grid. Then the sum of its 6 neighbors is \(6n - 6 + 6 = 6n\). The sum of the 6 neighbors of \(n\) is \(6n - 6\).
\(6 \times 2026 - 6 = 12156 - 6 = 12150\).
However, this formula is not correct.
This was not correct, but, in several cases the models reach the correct one:
Alternatively, for a number \(N\) on a side of the hexagon (and not a corner), the sum of its 6 neighbors is \(6N + 12\).
\(6(2026) + 12 = 12156 + 12 = 12168\).
I haven’t sampled an answer on the square problems that does something similar, and this might be a reason why the hexagonal grid problem seems to be easier to solve. It is possible that I didn’t get a formula in the square case also because the formula that a model might get is not valid near the corners.
Another possible reason for why the hexagonal grid was easier to solve might be that I have included more numbers in the prompt than in the square case. I’m curious how much this matters.
One more reason for why the hexagon problem is easier to solve is that the models are cheating: the models are able to look up OEIS sequences:
Searching OEIS or known data, we find:
This is a more common occurence for the hexagonal grid than for the square one. Under the assumption that models that don’t look up the answer make mistakes during computation, this could be another reason for why the hexagon problems has more correct solvers.
Note though that some of the sequences looked up are irrelevant: some are for modular arithmetic, some involve matrices on the dihedral group \(D_4\) and so on.
However, models are able to find the answer for the hexagonal grid, but have some issues on the square one. There are two important sequences that the models have discovered, A174344 and A274923. The first one is rotating in the opposite direction, and both of them start with a 0, instead of 1.
Rotating the grid might make the models not be able to cheat, at a first glance. However, a simple rotation cannot change the position of any number, so the answers would be the same. Thus, even if the model cheats, it will still get a correct answer. Introducing a 0 at the start might change things, but before testing that I’ll have to check if there really are no hexagonal grids that start with a 0 in OEIS (and that there are no square grids that start from 1).
Conclusions 🔗
I got so many unexpected results while running this experiment. First, I was thinking that the hexagon problem is harder, but models scored better on it. Then, models that people claimed to be better turned out to not score as high.
More importantly, I got models that cheated, as well as models that found that the answer can be given by a simple formula, after checking that the formula applies.
I have more experiments to run now, for a later time. Until then, I’m
publishing the entire code harness, results and the notebook used to generate
the plots on
my demos repo.
Right now, this is the first entry there, but I intend to add more.
Comments:
There are 0 comments (add more):