Benchmarks measuring TOON's token efficiency and retrieval accuracy compared to JSON, XML, YAML, and CSV.
Note
Results are automatically embedded in the main README. This guide focuses on running the benchmarks locally.
# Run token efficiency benchmark
pnpm benchmark:tokens
# Run retrieval accuracy benchmark (requires API keys)
pnpm benchmark:accuracyMeasures token count reduction across JSON, XML, YAML, CSV, and TOON:
- Generate datasets (GitHub repos, analytics, orders)
- Convert to all formats (TOON, JSON, XML, YAML, CSV)
- Tokenize using
gpt-tokenizer(o200k_baseencoding) - Calculate savings and generate report
pnpm benchmark:tokensResults are saved to results/token-efficiency.md.
Tests how well LLMs can answer questions about data in different formats (TOON, JSON, JSON compact, XML, YAML, CSV):
- Generate 209 questions across 11 datasets (6 primary + 5 structural validation; CSV only included for datasets with flat/tabular structure)
- Convert each dataset to all supported formats
- Query each LLM with formatted data + question
- Validate answers deterministically using type-aware comparison (no LLM judge needed)
- Aggregate metrics and generate report
- Edit
src/evaluate.tsand add models to the exportedmodelsarray:export const models: LanguageModelV3[] = [ openai('gpt-5-nano'), anthropic('claude-haiku-4-5-20251001'), google('gemini-3-flash-preview'), xai('grok-4-1-fast-non-reasoning'), // Add your models here ]
- Duplicate
.env.exampleto.envand add your API keys:cp .env.example .env
# Full benchmark
pnpm benchmark:accuracy
# Dry run (10 questions only, for testing setup)
DRY_RUN=true pnpm benchmark:accuracyRunning the script will:
- Prompt you to select which models to test.
- Skip models with existing results (rerun to overwrite).
- Show progress with rate limiting.
- Save results to
results/accuracy/models/{model-id}.json. - Generate report at
results/retrieval-accuracy.md.
Edit src/constants.ts to adjust:
MODEL_RPM_LIMITS– Rate limits per modelDEFAULT_CONCURRENCY– Parallel tasks (default: 10)DRY_RUN_LIMITS– Questions per dry run (default: 10)
scripts/
├── accuracy-benchmark.ts # Retrieval accuracy benchmark
├── token-efficiency-benchmark.ts # Token counting benchmark
└── fetch-github-repos.ts # Update GitHub dataset
src/
├── constants.ts # Configuration
├── datasets.ts # Test data generators
├── evaluate.ts # LLM evaluation
├── formatters.ts # Format converters
├── normalize.ts # Answer normalization
├── report.ts # Markdown reports
├── storage.ts # Result caching
├── types.ts # Type definitions
├── utils.ts # Helpers
└── questions/ # Question generators
├── analytics.ts
├── event-logs.ts
├── github.ts
├── index.ts
├── nested-config.ts
├── nested.ts
├── structural-validation.ts
├── structure.ts
├── tabular.ts
└── utils.ts
data/
└── github-repos.json # Top 100 GitHub repos
results/
├── token-efficiency.md # Token savings report
├── retrieval-accuracy.md # Accuracy report
└── accuracy/models/ # Per-model results (JSON)