A benchmark for LLMs and related tools for reasoning and answering questions about software logic. Results for comparing performance of LLM-only vs LLM+CodeLogician(ImandraX) (CodeLogician) presented.
Given a Python function and questions about its behavior (e.g., "How many distinct output scenarios exist?"), we compare:
- LLM-only: Answers generated by prompting LLMs with only the source code
- LLM + Automated Reasoning: Answers generated by CodeLogician, a neurosymbolic agentic governance framework
├── examples/ # 50 benchmark examples
│ └── <example_name>/
│ ├── model.py # Python source code
│ ├── model.iml # IML specification + analysis (region decomp/VG)
│ ├── questions.yaml # 3 questions about the code
│ ├── answer_CL.yaml # Answers from CodeLogician
│ ├── answer_LLM.yaml # Answers from pure LLM
│ ├── metrics.yaml # Evaluation scores
│ └── cl_analysis/ # Raw CodeLogician outputs (JSON)
├── analysis/ # Aggregated results and plots
├── generate_CL_answers/ # Prompts for generating CodeLogician answers (see README)
├── generate_llm_answers.py # Generate pure LLM responses
├── generate_metrics.py # Evaluate and compare answers
└── aggregate_metrics.py # Aggregate and visualize results