Semantic SQL Testing

A benchmark for Small Language Models (SLMs) on SQL generation tasks. It uses a well-defined semantic model with verified SQL queries and sample values to evaluate how accurately local LLMs can generate SQL from natural language questions.

Results

How It Works

Semantic Model — A detailed schema definition (semantic_model.txt) describing a TPC-DS star schema (store_sales, store_returns, date_dim, store, customer, item) with measures, dimensions, anti-patterns, and verified example queries
20 Test Questions — Ranging from simple aggregations to complex multi-year comparisons with return rates (questions.json)
Local LLM Inference — Models run locally via llama.cpp server, quantized to Q4_K_M (4-bit)
Automated Judging — Claude Opus acts as an invisible judge, comparing model outputs against a verified baseline using column-value matching with numeric tolerance

Hardware

GPU: NVIDIA RTX 2000 Mobile (4GB VRAM)
RAM: 32GB
Inference: llama.cpp b8157

Models Tested

Model	Parameters	Type	Quantization
Qwen3.5-35B-A3B	35B (3B active)	MoE	Q4_K_M
Qwen3-30B-A3B	30B (3B active)	MoE	Q4_K_M
Qwen3-4B	4B	Dense	Q4_K_M
GLM-4.7B Flash	4.7B	Dense	Q4_K_M

Running the Benchmark

Prerequisites

Python with duckdb, pandas, matplotlib, requests, psutil
llama.cpp server binary
GGUF model files

Usage

Configure model paths in MODEL_CONFIGS inside SLM_SQL_test.ipynb
Run the notebook — the run_test() function handles everything:
- Starts llama-server with the selected model
- Runs all 20 SQL questions with feedback loop
- Validates results against the baseline
- Plots speed vs accuracy chart
- Stops the server

run_test('Qwen3.5-35B-A3B')

Project Structure

├── SLM_SQL_test.ipynb    # Main benchmark notebook
├── semantic_model.txt    # Semantic model definition (system prompt)
├── questions.json        # 20 test questions
├── log/                  # Test results (JSON logs per model run)
└── LICENSE               # MIT License

Key Design Decisions

Never join fact tables directly — Uses CTE pattern with FULL OUTER JOIN for combining sales and returns
Feedback loop — Models get a chance to self-correct their SQL after seeing initial results
Column-value matching — Judging ignores column names and matches by value similarity with numeric tolerance (0.5%), handling cases where models return extra columns
Invisible baseline — The reference answers are excluded from charts, only evaluated models are shown

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 76 Commits
log		log
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
SLM_SQL_test.ipynb		SLM_SQL_test.ipynb
chart.png		chart.png
questions.json		questions.json
semantic_model.txt		semantic_model.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Semantic SQL Testing

Results

How It Works

Hardware

Models Tested

Running the Benchmark

Prerequisites

Usage

Project Structure

Key Design Decisions

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

License

djouallah/semantic_sql_testing

Folders and files

Latest commit

History

Repository files navigation

Semantic SQL Testing

Results

How It Works

Hardware

Models Tested

Running the Benchmark

Prerequisites

Usage

Project Structure

Key Design Decisions

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages