A benchmark for Small Language Models (SLMs) on SQL generation tasks. It uses a well-defined semantic model with verified SQL queries and sample values to evaluate how accurately local LLMs can generate SQL from natural language questions.
- Semantic Model — A detailed schema definition (
semantic_model.txt) describing a TPC-DS star schema (store_sales, store_returns, date_dim, store, customer, item) with measures, dimensions, anti-patterns, and verified example queries - 20 Test Questions — Ranging from simple aggregations to complex multi-year comparisons with return rates (
questions.json) - Local LLM Inference — Models run locally via llama.cpp server, quantized to Q4_K_M (4-bit)
- Automated Judging — Claude Opus acts as an invisible judge, comparing model outputs against a verified baseline using column-value matching with numeric tolerance
- GPU: NVIDIA RTX 2000 Mobile (4GB VRAM)
- RAM: 32GB
- Inference: llama.cpp b8157
| Model | Parameters | Type | Quantization |
|---|---|---|---|
| Qwen3.5-35B-A3B | 35B (3B active) | MoE | Q4_K_M |
| Qwen3-30B-A3B | 30B (3B active) | MoE | Q4_K_M |
| Qwen3-4B | 4B | Dense | Q4_K_M |
| GLM-4.7B Flash | 4.7B | Dense | Q4_K_M |
- Python with
duckdb,pandas,matplotlib,requests,psutil - llama.cpp server binary
- GGUF model files
- Configure model paths in
MODEL_CONFIGSinsideSLM_SQL_test.ipynb - Run the notebook — the
run_test()function handles everything:- Starts llama-server with the selected model
- Runs all 20 SQL questions with feedback loop
- Validates results against the baseline
- Plots speed vs accuracy chart
- Stops the server
run_test('Qwen3.5-35B-A3B')├── SLM_SQL_test.ipynb # Main benchmark notebook
├── semantic_model.txt # Semantic model definition (system prompt)
├── questions.json # 20 test questions
├── log/ # Test results (JSON logs per model run)
└── LICENSE # MIT License
- Never join fact tables directly — Uses CTE pattern with FULL OUTER JOIN for combining sales and returns
- Feedback loop — Models get a chance to self-correct their SQL after seeing initial results
- Column-value matching — Judging ignores column names and matches by value similarity with numeric tolerance (0.5%), handling cases where models return extra columns
- Invisible baseline — The reference answers are excluded from charts, only evaluated models are shown
MIT
