Skip to content

Benchmarks

1,300+ LLM evaluations across 10 models, 3 providers, and 51 independent test runs.

No model has ever been trained on GCF.

Every model reads it better and writes it better than the formats they were trained on.

GCFTOONJSON
Comprehension (23 runs, 10 models)90.5%68.5%53.6%
Generation (28 runs, 9 models)5/51.0/55.0/5
Input tokens (500 symbols)11,09016,37853,341
Output tokens (100 symbols)5,9768,93716,121

Three benchmark suites, three providers (Anthropic, OpenAI, Google), zero training:

  1. Comprehension eval: Can models extract information from a format? 500 symbols, 13 questions, 23 runs across 10 models.
  2. Generation eval: Can models produce valid output in a format? 3-line primer, 28 runs across 9 models.
  3. TOON's benchmark: Token efficiency: How many tokens does each format cost? This is TOON's own benchmark, forked unmodified, with GCF added as one additional formatter.

All results reproducible.

Comprehension and Generation


Comprehension: Can LLMs Read It?

500 symbols, 200 edges, zero format instructions. Each run generates a fresh random payload. The model receives the payload in each format and answers 13 extraction questions:

#CategoryQuestion
1CountingHow many symbols?
2CountingHow many edges?
3CountingHow many targets (distance 0)?
4CountingHow many related (distance 1)?
5CountingHow many extended (distance 2)?
6CountingHow many functions?
7CountingHow many 'calls' edges?
8ExtractionHighest-scored symbol name?
9ExtractionKind of highest-scored symbol?
10ExtractionKind of last symbol?
11ExtractionAll unique edge types?
12StructureDoes it have an edges section?
13StructureWhat is the tool name?

All answers are deterministic (computed from the payload). No LLM judge.

When an agent receives data in JSON at this scale, it gets the wrong answer 46% of the time. With TOON, 32% of the time. With GCF, 10%.

Comprehension Accuracy by Model

ModelRunsGCF avgTOON avgJSON avg
Claude Opus 4.6296.2%84.6%73.1%
Claude Sonnet 4.62100%73.1%53.8%
Claude Haiku 4.5296.2%69.2%57.7%
GPT-5.5584.1%67.7%45.8%
GPT-5.4476.4%56.0%44.1%
GPT-5.4-mini271.8%64.1%54.2%
Gemini 2.5 Pro1100%76.9%58.3%
Gemini 3.1 Pro1100%76.9%46.2%
Gemini 3.5 Flash1100%61.5%46.2%
Gemini 2.5 Flash380.6%54.6%57.0%

GCF wins on every model. The ordering GCF > TOON > JSON never flips.

The answer is 167. Here's what each format gives the model:

GCF: The answer is in the section header.

## related [167]{qualifiedName|kind|score}
@0 handler.Response.Notify|fn|0.82
@1 model.SubscribeConfig|type|0.81
...

The model reads 167. Done.

TOON: All 500 symbols in one flat table. The model must scan every row, filter by the distance column, and count.

symbols[500]{name,kind,score,distance}:
  handler.Response.Notify,function,0.82,1
  model.SubscribeConfig,type,0.81,1
  ...

Answers across runs: 100, 115, 165, 172, 190, 214. Wildly inconsistent.

JSON: Same as TOON but with 53,000 tokens of repeated field names. Claude Opus, the most capable model on earth, responded by enumerating symbols one by one:

"Let me count precisely by going through the list:1. handler.Response.Notify2. model.SubscribeConfig3. service.PublishOptions...143. store.DispatchConfig

So: 143."

143 lines of output. Wrong answer. The correct answer was 167. (Full artifact)

Error magnitude

When GCF gets an answer wrong, it's off by 1-2 (median error: 4). When TOON and JSON get answers wrong, they're off by 50-140 (median error: 53 and 56). GCF fails on precision. TOON and JSON fail on comprehension. And when GCF does fail, the model returns a short wrong number. When JSON fails, the model burns 143 lines of output tokens on a manual enumeration and still gets it wrong. GCF fails cheaper.

Error Magnitude

See the full failure taxonomy for the complete analysis.


Generation: Can LLMs Write It?

The model is given a natural-language description of symbols and edges (e.g., "ProductManager, class, score 1.0, target") and a 3-line format primer. It must produce valid, decoder-parseable output. Tested at 5, 10, 20, 50, and 100 symbols.

Output validated through real decoders: the official toon-go library for TOON, gcf.Decode() for GCF, json.Unmarshal() for JSON. Same data, same descriptions, same prompt structure across all three formats.

Generation Validity by Model

ModelGCFTOONJSON
Claude Opus 4.65/50/55/5
Claude Sonnet 4.65/52-3/55/5
Claude Haiku 4.55/51-3/55/5
GPT-5.54-5/51-2/55/5
GPT-5.45/50/55/5
GPT-5.4-mini5/50/55/5
Gemini 2.5 Pro5/51/55/5
Gemini 3.1 Pro5/50/55/5
Gemini 3.1 Flash Lite4-5/50/54-5/5

GCF is the only format every frontier model can produce. TOON's official decoder rejects the output on 7 of 9 models.

Why TOON fails

The Distance Label Problem

TOON's flat columns require the model to encode semantic categories as integers. When told "this symbol is a target," the model writes target in the distance column. TOON's decoder expects 0. Every model tested fails to perform this mapping unprompted.

GCF expresses distance through section placement: targets go in ## targets, related symbols go in ## related. No mapping required. The format aligns with how models naturally express grouped data.

When TOON is given pre-encoded integers (hand-holding the model through the mapping it can't do on its own), it passes 5/5 but still produces 28% more output than GCF.

Output size

GCF output is 63% smaller than JSON and 33% smaller than TOON at 100 symbols. Every output token costs money. At scale, this compounds.

Output Size at Scale


TOON's Benchmark: Token Efficiency

This is not our benchmark. This is TOON's benchmark, forked unmodified. Their datasets, their tokenizer (gpt-tokenizer, o200k_base), their methodology. We added one line: a GCF formatter. Everything else is TOON's code measuring TOON's chosen datasets.

DatasetStructureGCFTOONJSON
Event logsSemi-uniform108,158154,032181,141
E-commerceNested61,59373,246109,574
Nested configDeep616618905
EmployeesFlat49,05449,966127,050
AnalyticsFlat8,3979,12722,257
GitHub reposFlat8,5758,74415,144

GCF wins all 6 datasets. 42% smaller than TOON on semi-uniform data, 2-8% on flat data.


Reproduce

All evals are in gcf-go/eval. All raw logs are in eval/results.

bash
git clone https://github.com/blackwell-systems/gcf-go
cd gcf-go/eval

# Comprehension (any backend)
GOWORK=off go test -run TestComprehension -v -timeout 0
EVAL_BACKEND=openai OPENAI_API_KEY=... EVAL_MODEL=gpt-5.5 GOWORK=off go test -run TestComprehension -v -timeout 0
EVAL_BACKEND=google GOOGLE_API_KEY=... EVAL_MODEL=gemini-2.5-flash GOWORK=off go test -run TestComprehension -v -timeout 0

# Generation (all three formats)
GOWORK=off go test -run "TestGeneration$|TestGenerationTOON|TestGenerationJSON" -v -timeout 0

# Token efficiency
git clone https://github.com/blackwell-systems/toon.git
cd toon && git checkout gcf-comparison
cd benchmarks && pnpm install && pnpm benchmark:tokens

For detailed failure analysis, error taxonomy, and per-run data, see the full eval results.