Skip to content

Benchmarks (Full Data)

Every number on the benchmarks page comes from the runs below. This page has the complete per-run data, failure analysis, and generation results across all 10 models and 3 providers. All raw logs are in the eval/results directory.

Comprehension and Generation


Comprehension: All Runs

Comprehension Accuracy by Model

500 symbols, 200 edges, 13 structured extraction questions, zero format instructions. Each run generates a fresh random payload with different symbol names and edge distributions, so variance across runs reflects the model's actual comprehension rather than memorization of a fixed dataset.

ModelRunGCFTOONJSONGCF wins?
Claude Opus 4.61100%92.3%76.9%
Claude Opus 4.6292.3%76.9%69.2%
Claude Sonnet 4.61100%76.9%53.8%
Claude Sonnet 4.62100%69.2%53.8%
Claude Haiku 4.5192.3%69.2%61.5%
Claude Haiku 4.52100%69.2%53.8%
GPT-5.5191.7%66.7%50.0%
GPT-5.5276.9%69.2%46.2%
GPT-5.5376.9%69.2%46.2%
GPT-5.5491.7%66.7%50.0%
GPT-5.5583.3%66.7%36.4%
GPT-5.4175.0%58.3%41.7%
GPT-5.4276.9%53.8%46.2%
GPT-5.4376.9%53.8%38.5%
GPT-5.4483.3%58.3%50.0%
GPT-5.4-mini176.9%61.5%58.3%
GPT-5.4-mini266.7%66.7%50.0%tied
Gemini 2.5 Flash176.9%58.3%53.8%
Gemini 2.5 Flash275.0%50.0%57.1%
Gemini 2.5 Flash390.0%55.6%60.0%
Gemini 3.5 Flash1100%61.5%46.2%
Gemini 2.5 Pro1100%76.9%58.3%
Gemini 3.1 Pro1100%76.9%46.2%

23 runs, 10 models, 3 providers. GCF wins 22, ties 1, loses 0. Four models achieve 100%: Sonnet, Gemini 2.5 Pro, Gemini 3.1 Pro, Gemini 3.5 Flash.

Score variance

GCF scores cluster tightly across runs. TOON and JSON scatter widely, especially on weaker models. This matters for production use: a format that scores 70% on one run and 40% on the next is unreliable regardless of its average.

Comprehension Variance

GCF advantage by model tier

GCF's margin over TOON and JSON grows as model capability decreases. Frontier models (Opus, Gemini Pro) can partially brute-force flat data through sheer capacity. Smaller models (GPT-5.4-mini, Gemini 2.5 Flash) cannot, and the gap widens to 20-30 percentage points. If you're optimizing for cost by using smaller models, format choice matters more, not less.

GCF Advantage by Tier

Token cost vs accuracy

GCF occupies the top-left quadrant: fewest tokens, highest accuracy. JSON occupies the bottom-right: most tokens, lowest accuracy. You usually trade cost for quality. GCF breaks that tradeoff.

Token Cost vs Accuracy


Failure Taxonomy

Every wrong answer across all 23 runs was classified by failure type. The pattern is consistent: GCF fails differently than TOON and JSON. GCF errors are small (off by 1-2) because the model understood the structure but misread a number. TOON and JSON errors are large (off by 50-140) because the model couldn't extract the answer at all and guessed.

Error Magnitude

GCF median error: 4. TOON median error: 53. JSON median error: 56. GCF encodes answers structurally (## related [167]). TOON/JSON force the model to compute them from raw data. The difference between "slightly misread a header" and "couldn't comprehend the data" is the difference between a useful agent and a broken one.

GCF failures: precision errors

GCF fails on precision (off by 1-7). The structure is understood; the count is slightly misread. 36 total failures across 23 runs.

TypeCountModelsCause
Off-by-1-2 header misread8Haiku (1), GPT-5.4 (3), mini (1), Gemini (3)Header says [167], model reads 166. Tokenization artifact.
Column scan miscount11GPT-5.4 (5), mini (1), Gemini (5)Must scan fn kind across rows. function_count=84 deterministically on GPT-5.4.
Field confusion2GPT-5.4 (1), mini (1)Read symbol count instead of edge count.
Miscellaneous miscount5GPT-5.4 (2), Gemini (3)edge_count, calls_edge_count off by larger margins.
Empty response10GPT-5.5 (10)Context overwhelm at 53k+ input tokens.

TOON failures: comprehension errors

TOON fails on comprehension (wrong by 50-140). The model cannot filter a flat list by column value at scale. 94 total failures across 23 runs.

TypeCountModelsCause
Distance grouping failure45Opus/Sonnet (3), Haiku (6), GPT-5.4 (11), mini (5), Gemini (20)Must scan 500 rows and filter by distance column. Wildly inconsistent answers.
Column scan miscount10Haiku (1), GPT-5.4 (4), mini (4), Gemini (1)function_count wrong. Must scan all 500 rows by kind.
Attention decay (last row)7Opus/Sonnet (1), Haiku (3), GPT-5.4 (3)last_symbol_kind wrong. Loses track at row 500.
Calls edge miscount10Opus/Sonnet (1), GPT-5.4 (4), mini (2), Gemini (3)calls_edge_count wrong. Must scan edges and filter by type.
Symbol count wrong2Gemini (2)Undercounts total symbols (250, 400 vs 500).
Empty response20GPT-5.5 (20)Context overwhelm. Same as JSON.

JSON failures: structural overwhelm

JSON fails on structure (empty responses, massive undercounts, chain-of-thought enumeration). The format itself prevents comprehension at scale. 131 total failures across 23 runs.

TypeCountModelsCause
Empty string response33GPT-5.5 (33)53k tokens of repeated {"qualifiedName":...} overwhelms attention.
Massive undercount14Opus/Sonnet (2), Haiku (2), GPT-5.4 (4), mini (1), Gemini (5)Field-name repetition dilutes signal.
Distance filter failure44Opus/Sonnet (7), Haiku (6), GPT-5.4 (11), mini (5), Gemini (15)Must parse JSON objects AND filter by field value.
Column scan miscount37Opus/Sonnet (4), Haiku (3), GPT-5.4 (8), mini (4), Gemini (18)edge_count, function_count, calls_edge_count wrong.
Attention decay (last row)3GPT-5.4 (2), Gemini (1)last_symbol_kind reads edge type instead of kind.

Failure distribution by format

JSON accounts for the most failures overall, driven by GPT-5.5's complete inability to respond (33 empty strings) and universal distance-filtering failures. TOON's failures concentrate on distance grouping and round-number guessing. GCF's failures are sparse and small.

Failure Types (Pie)

Failures by model tier

Each model tier has a distinct failure signature. Opus/Sonnet never fail on GCF. GPT-5.5 fails on all formats due to context overwhelm at 53k tokens. GPT-5.4's GCF errors are deterministic (same wrong number every run), suggesting a tokenizer-level parsing difference rather than a comprehension problem.

Failure Types by Model

ModelGCF failure modeTOON failure modeJSON failure mode
Opus/SonnetNoneOff-by-2 extended_count; last_symbol_kind wrong (attention decay at row 500)Undercounts (356 vs 500); 143-line chain-of-thought enumeration, still wrong answer
Haiku 4.5Off-by-1 (1 of 2 runs)Distance grouping (100, 200, 214 vs 166); last_symbol_kind wrongUndercounts; distance filter failures
GPT-5.5Empty strings (context overwhelm at 53k input tokens)Empty strings; distance grouping failuresReturns nothing on most questions (53k tokens of repeated field names overwhelms attention)
GPT-5.4Deterministic: edge_count=198, function_count=84 every runDistance grouping wildly inconsistent (169, 229, 200 vs 166); round-number guessingsymbol_count 326-404; massive undercounts everywhere
GPT-5.4-miniSame as 5.4 (198, 84) plus larger misses (250, 100)Worst distance grouping (26, 28 vs 166); defaults to round-number guessing300 vs 500 symbol_count; consistent failure across all question types

GCF failures on Claude are near-zero. GCF failures on OpenAI are deterministic and repeatable (same wrong number every run), suggesting a tokenizer-level parsing difference rather than a comprehension issue.

Artifacts: What JSON failure looks like in practice

When asked "how many symbols have distance 1 (related)?", the answer is 167. GCF encodes it in the section header: ## related [167]. The model reads it directly.

JSON has no structural grouping. The model must scan 500 JSON objects, filter by "Distance": 1, and count. On two separate runs, Claude Opus (the most capable model on earth) responded by manually enumerating every symbol:

Run 1 (full artifact):

"Let me count precisely by going through the list:1. handler.Response.Notify2. model.SubscribeConfig3. service.PublishOptions...143. store.DispatchConfig

So: 143."

143 lines of output tokens. Wrong answer (expected 167).

Run 2 (full artifact):

"Let me count systematically. The symbols list transitions from Distance 0 to Distance 1 at handler.Response.Notify...1. handler.Response.Notify2. model.SubscribeConfig...119. store.DispatchConfig120. cache.ExecuteOptions -- wait, this is Distance 2.

So: 119."

119 lines. Wrong again (expected 167). Different random payload, same failure mode. The model even caught itself mid-count ("wait, this is Distance 2") and still got it wrong.

This is JSON's structural problem: it forces LLMs to perform manual enumeration at scale, burning output tokens on a task the format should have answered structurally. GCF answers the same question from a 3-character header lookup.


Generation: All Runs

Comprehension measures whether a model can read a format. Generation measures whether it can write one. A format that's readable but not writable (or vice versa) is only half useful. Agent-to-agent communication requires both directions.

GCF validity across all models

GCF achieves 5/5 valid output on every frontier model, with zero prior training. The format didn't exist before we built it, yet models produce decoder-parseable output on first exposure with a 3-line primer.

Generation Validity

Model5 sym10 sym20 sym50 sym100 symScoreRuns
Claude Opus 4.6YESYESYESYESYES5/52 (zero variance)
Claude Sonnet 4.6YESYESYESYESYES5/52
Claude Haiku 4.5YESYESYESYESYES5/52
GPT-5.5YESYESYESYES4-5/54-5/52
GPT-5.4YESYESYESYESYES5/51
GPT-5.4-miniYESYESYESYESYES5/52 (zero variance)
Gemini 2.5 ProYESYESYESYESYES5/52 (zero variance)
Gemini 3.1 ProYESYESYESYESYES5/51
Gemini 3.1 Flash LiteYESYESYESYES4-5/54-5/53

Three-way comparison

Same data, same prompt structure per format. GCF and JSON use natural-language descriptions ("this symbol is a target"). TOON uses the same natural descriptions, not pre-encoded integers. This is the fair comparison: what happens when you give a model real-world input and ask it to produce structured output?

ModelGCFTOON (natural)JSONRuns
Claude Opus 4.65/50/55/52 (zero variance)
Claude Sonnet 4.65/52-3/55/52
Claude Haiku 4.55/51-3/55/52
GPT-5.54-5/51-2/55/52
GPT-5.45/50/55/51
GPT-5.4-mini5/50/55/52 (zero variance)
Gemini 2.5 Pro5/51/55/52 (zero variance)
Gemini 3.1 Pro5/50/55/51
Gemini 3.5 Flash3/51/53/51
Gemini 3.1 Flash Lite4-5/50/54-5/53
Gemini 2.5 Flash2-3/50-4/50-3/53 (output truncation)

Why TOON fails generation

Every TOON generation failure produces the same error: toon: cannot assign string to int. The model writes target in the distance column. TOON expects 0. The model would need to know, unprompted, that "target" maps to 0, "related" maps to 1, "extended" maps to 2. No model does this because the format gives no structural cue for when a column requires an integer vs a string.

The Distance Label Problem

TOON generation heatmap

TOON failure is concentrated on distance-related sizes. Models that pass at 5 symbols often fail at 10+ because the likelihood of hitting the distance encoding problem increases with more symbols. The heatmap shows which models fail at which sizes.

TOON Heatmap

TOON is a fundamentally fragile format

TOON requires special handling by the caller to produce valid results. When given the same natural-language description that GCF and JSON handle without issue, TOON's official decoder rejects the output on 7 of 9 models. The format's flat tabular design encodes semantic categories as integers, forcing an encoding step that no model performs unprompted. This isn't a prompt engineering problem; it's a structural design flaw.

When we explicitly pre-encode distances as integers in the prompt (hand-holding the model through TOON's internal mapping), performance improves on some models but remains inconsistent. Even in the best case, TOON output is 28% larger than GCF.

FormatPromptValid100 sym outputvs JSON
GCFnatural labels5/55,984 B78% fewer
TOONhand-held (integers)5/58,336 B69% fewer
TOONnatural labels0/5--
JSONnatural labels5/516,121 Bbaseline

GCF is robust. It works with natural-language descriptions, pre-encoded values, and everything in between. The format aligns with how models naturally express grouped data. TOON requires the caller to know its internal encoding and pre-process every categorical field before the model can write valid output. Any time a column encodes a semantic category as an integer, TOON is one prompt change away from producing invalid data.

Output size at scale

Generation cost compounds over a session. Every tool response an agent produces costs output tokens. At 100 symbols, GCF output is 5,984 bytes vs 16,121 for JSON (63% fewer) and 8,336 for TOON with hand-holding (28% fewer). Over a 10-call agent session, this adds up.

Output Cost at Scale


Methodology

The eval was designed to be deterministic, reproducible, and resistant to gaming.

  • Scale: 500 symbols, 200 edges for comprehension; 5-100 symbols for generation. 500 records is the threshold where format differences become visible. At 8 records, everything works.
  • Ground truth: 13 extraction questions with deterministic answers computed from the payload. No LLM judge. The correct answer to "how many symbols?" is always exactly the number generated.
  • Randomization: Each run generates a fresh random payload with different symbol names and edge distributions. Scores reflect comprehension of the format, not memorization of a fixed dataset.
  • Temperature: OpenAI runs used default temperature (non-zero) to reflect real-world usage. EVAL_TEMPERATURE=0 is available for deterministic runs.
  • Backends: Claude evals via claude -p CLI with --model flag. OpenAI evals via chat completions API with exponential backoff on 429s. Google evals via generativelanguage API with retry logic.
  • Validation: GCF validated through gcf.Decode(). TOON validated through the official toon-go library. JSON validated through json.Unmarshal(). All decoders are real implementations, not regex checks.
  • Logs: All raw logs in eval/results. Every run is committed.

Reproduce

bash
git clone https://github.com/blackwell-systems/gcf-go
cd gcf-go/eval

# Comprehension
GOWORK=off go test -run TestComprehension -v -timeout 0
EVAL_BACKEND=openai OPENAI_API_KEY=... EVAL_MODEL=gpt-5.5 GOWORK=off go test -run TestComprehension -v -timeout 0
EVAL_BACKEND=google GOOGLE_API_KEY=... EVAL_MODEL=gemini-2.5-flash GOWORK=off go test -run TestComprehension -v -timeout 0

# Generation (all three formats)
GOWORK=off go test -run "TestGeneration$|TestGenerationTOON|TestGenerationJSON" -v -timeout 0

# Token efficiency (TOON's benchmark)
git clone https://github.com/blackwell-systems/toon.git
cd toon && git checkout gcf-comparison
cd benchmarks && pnpm install && pnpm benchmark:tokens