Math
arXiv:2504.04823Detects arithmetic accuracy drift in complex reasoning chains.
Scientific degradation detection. 17 research-backed probes.
One CLI command.
"General benchmarks miss silent degradation. Models get quantized, fine-tuned, or simply drift. We detect the rot."
Organized into three tiers based on signal-to-noise ratio and cost.
Detects arithmetic accuracy drift in complex reasoning chains.
Measures vocabulary collapse and TTR (Type-Token Ratio).
Tracks time-to-first-token (TTFT) and generation latency.
Syntax validity and logic checks for Python/Rust generation.
Measures hallucination rate on verifiable world knowledge.
Structured output adherence and schema validation failures.
Self-consistency across multiple generations of same prompt.
Identifies specific quantization artifacts and model signatures.
Needle-in-haystack retrieval and context window adherence.
Tool selection and classification accuracy.
First-order logic and syllogistic reasoning capabilities.
Advanced zero-shot capability fingerprinting.
Cross-lingual reasoning and translation fidelity.
Try the interactive demo on HuggingFace Spaces.
🤗 Open Space