Benchmarks
1,300+ LLM evaluations across 10 models, 3 providers, and 51 independent test runs.
No model has ever been trained on GCF.
Every model reads it better and writes it better than the formats they were trained on.
| GCF | TOON | JSON | |
|---|---|---|---|
| Comprehension (23 runs, 10 models) | 90.5% | 68.5% | 53.6% |
| Generation (28 runs, 9 models) | 5/5 | 1.0/5 | 5.0/5 |
| Input tokens (500 symbols) | 11,090 | 16,378 | 53,341 |
| Output tokens (100 symbols) | 5,976 | 8,937 | 16,121 |
Three benchmark suites, three providers (Anthropic, OpenAI, Google), zero training:
- Comprehension eval: Can models extract information from a format? 500 symbols, 13 questions, 23 runs across 10 models.
- Generation eval: Can models produce valid output in a format? 3-line primer, 28 runs across 9 models.
- TOON's benchmark: Token efficiency: How many tokens does each format cost? This is TOON's own benchmark, forked unmodified, with GCF added as one additional formatter.
All results reproducible.

Comprehension: Can LLMs Read It?
500 symbols, 200 edges, zero format instructions. Each run generates a fresh random payload. The model receives the payload in each format and answers 13 extraction questions:
| # | Category | Question |
|---|---|---|
| 1 | Counting | How many symbols? |
| 2 | Counting | How many edges? |
| 3 | Counting | How many targets (distance 0)? |
| 4 | Counting | How many related (distance 1)? |
| 5 | Counting | How many extended (distance 2)? |
| 6 | Counting | How many functions? |
| 7 | Counting | How many 'calls' edges? |
| 8 | Extraction | Highest-scored symbol name? |
| 9 | Extraction | Kind of highest-scored symbol? |
| 10 | Extraction | Kind of last symbol? |
| 11 | Extraction | All unique edge types? |
| 12 | Structure | Does it have an edges section? |
| 13 | Structure | What is the tool name? |
All answers are deterministic (computed from the payload). No LLM judge.
When an agent receives data in JSON at this scale, it gets the wrong answer 46% of the time. With TOON, 32% of the time. With GCF, 10%.

| Model | Runs | GCF avg | TOON avg | JSON avg |
|---|---|---|---|---|
| Claude Opus 4.6 | 2 | 96.2% | 84.6% | 73.1% |
| Claude Sonnet 4.6 | 2 | 100% | 73.1% | 53.8% |
| Claude Haiku 4.5 | 2 | 96.2% | 69.2% | 57.7% |
| GPT-5.5 | 5 | 84.1% | 67.7% | 45.8% |
| GPT-5.4 | 4 | 76.4% | 56.0% | 44.1% |
| GPT-5.4-mini | 2 | 71.8% | 64.1% | 54.2% |
| Gemini 2.5 Pro | 1 | 100% | 76.9% | 58.3% |
| Gemini 3.1 Pro | 1 | 100% | 76.9% | 46.2% |
| Gemini 3.5 Flash | 1 | 100% | 61.5% | 46.2% |
| Gemini 2.5 Flash | 3 | 80.6% | 54.6% | 57.0% |
GCF wins on every model. The ordering GCF > TOON > JSON never flips.
Example: "How many related symbols?"
The answer is 167. Here's what each format gives the model:
GCF: The answer is in the section header.
## related [167]{qualifiedName|kind|score}
@0 handler.Response.Notify|fn|0.82
@1 model.SubscribeConfig|type|0.81
...The model reads 167. Done.
TOON: All 500 symbols in one flat table. The model must scan every row, filter by the distance column, and count.
symbols[500]{name,kind,score,distance}:
handler.Response.Notify,function,0.82,1
model.SubscribeConfig,type,0.81,1
...Answers across runs: 100, 115, 165, 172, 190, 214. Wildly inconsistent.
JSON: Same as TOON but with 53,000 tokens of repeated field names. Claude Opus, the most capable model on earth, responded by enumerating symbols one by one:
"Let me count precisely by going through the list:1. handler.Response.Notify2. model.SubscribeConfig3. service.PublishOptions...143. store.DispatchConfig
So: 143."
143 lines of output. Wrong answer. The correct answer was 167. (Full artifact)
Error magnitude
When GCF gets an answer wrong, it's off by 1-2 (median error: 4). When TOON and JSON get answers wrong, they're off by 50-140 (median error: 53 and 56). GCF fails on precision. TOON and JSON fail on comprehension. And when GCF does fail, the model returns a short wrong number. When JSON fails, the model burns 143 lines of output tokens on a manual enumeration and still gets it wrong. GCF fails cheaper.

See the full failure taxonomy for the complete analysis.
Generation: Can LLMs Write It?
The model is given a natural-language description of symbols and edges (e.g., "ProductManager, class, score 1.0, target") and a 3-line format primer. It must produce valid, decoder-parseable output. Tested at 5, 10, 20, 50, and 100 symbols.
Output validated through real decoders: the official toon-go library for TOON, gcf.Decode() for GCF, json.Unmarshal() for JSON. Same data, same descriptions, same prompt structure across all three formats.

| Model | GCF | TOON | JSON |
|---|---|---|---|
| Claude Opus 4.6 | 5/5 | 0/5 | 5/5 |
| Claude Sonnet 4.6 | 5/5 | 2-3/5 | 5/5 |
| Claude Haiku 4.5 | 5/5 | 1-3/5 | 5/5 |
| GPT-5.5 | 4-5/5 | 1-2/5 | 5/5 |
| GPT-5.4 | 5/5 | 0/5 | 5/5 |
| GPT-5.4-mini | 5/5 | 0/5 | 5/5 |
| Gemini 2.5 Pro | 5/5 | 1/5 | 5/5 |
| Gemini 3.1 Pro | 5/5 | 0/5 | 5/5 |
| Gemini 3.1 Flash Lite | 4-5/5 | 0/5 | 4-5/5 |
GCF is the only format every frontier model can produce. TOON's official decoder rejects the output on 7 of 9 models.
Why TOON fails

TOON's flat columns require the model to encode semantic categories as integers. When told "this symbol is a target," the model writes target in the distance column. TOON's decoder expects 0. Every model tested fails to perform this mapping unprompted.
GCF expresses distance through section placement: targets go in ## targets, related symbols go in ## related. No mapping required. The format aligns with how models naturally express grouped data.
When TOON is given pre-encoded integers (hand-holding the model through the mapping it can't do on its own), it passes 5/5 but still produces 28% more output than GCF.
Output size
GCF output is 63% smaller than JSON and 33% smaller than TOON at 100 symbols. Every output token costs money. At scale, this compounds.

TOON's Benchmark: Token Efficiency
This is not our benchmark. This is TOON's benchmark, forked unmodified. Their datasets, their tokenizer (gpt-tokenizer, o200k_base), their methodology. We added one line: a GCF formatter. Everything else is TOON's code measuring TOON's chosen datasets.
| Dataset | Structure | GCF | TOON | JSON |
|---|---|---|---|---|
| Event logs | Semi-uniform | 108,158 | 154,032 | 181,141 |
| E-commerce | Nested | 61,593 | 73,246 | 109,574 |
| Nested config | Deep | 616 | 618 | 905 |
| Employees | Flat | 49,054 | 49,966 | 127,050 |
| Analytics | Flat | 8,397 | 9,127 | 22,257 |
| GitHub repos | Flat | 8,575 | 8,744 | 15,144 |
GCF wins all 6 datasets. 42% smaller than TOON on semi-uniform data, 2-8% on flat data.
Reproduce
All evals are in gcf-go/eval. All raw logs are in eval/results.
git clone https://github.com/blackwell-systems/gcf-go
cd gcf-go/eval
# Comprehension (any backend)
GOWORK=off go test -run TestComprehension -v -timeout 0
EVAL_BACKEND=openai OPENAI_API_KEY=... EVAL_MODEL=gpt-5.5 GOWORK=off go test -run TestComprehension -v -timeout 0
EVAL_BACKEND=google GOOGLE_API_KEY=... EVAL_MODEL=gemini-2.5-flash GOWORK=off go test -run TestComprehension -v -timeout 0
# Generation (all three formats)
GOWORK=off go test -run "TestGeneration$|TestGenerationTOON|TestGenerationJSON" -v -timeout 0
# Token efficiency
git clone https://github.com/blackwell-systems/toon.git
cd toon && git checkout gcf-comparison
cd benchmarks && pnpm install && pnpm benchmark:tokensFor detailed failure analysis, error taxonomy, and per-run data, see the full eval results.