Skip to content

A benchmark for LLMs and related tools for reasoning and answering questions about software logic. Results for comparing performance of LLM-only vs LLM+CodeLogician(ImandraX) (CodeLogician) presented.

License

Notifications You must be signed in to change notification settings

imandra-ai/code-logic-bench

Repository files navigation

code-logic-bench

A benchmark for LLMs and related tools for reasoning and answering questions about software logic. Results for comparing performance of LLM-only vs LLM+CodeLogician(ImandraX) (CodeLogician) presented.

Overview

Given a Python function and questions about its behavior (e.g., "How many distinct output scenarios exist?"), we compare:

  • LLM-only: Answers generated by prompting LLMs with only the source code
  • LLM + Automated Reasoning: Answers generated by CodeLogician, a neurosymbolic agentic governance framework

Structure

├── examples/                    # 50 benchmark examples
│   └── <example_name>/
│       ├── model.py             # Python source code
│       ├── model.iml            # IML specification + analysis (region decomp/VG)
│       ├── questions.yaml       # 3 questions about the code
│       ├── answer_CL.yaml       # Answers from CodeLogician
│       ├── answer_LLM.yaml      # Answers from pure LLM
│       ├── metrics.yaml         # Evaluation scores
│       └── cl_analysis/         # Raw CodeLogician outputs (JSON)
├── analysis/                    # Aggregated results and plots
├── generate_CL_answers/         # Prompts for generating CodeLogician answers (see README)
├── generate_llm_answers.py      # Generate pure LLM responses
├── generate_metrics.py          # Evaluate and compare answers
└── aggregate_metrics.py         # Aggregate and visualize results

About

A benchmark for LLMs and related tools for reasoning and answering questions about software logic. Results for comparing performance of LLM-only vs LLM+CodeLogician(ImandraX) (CodeLogician) presented.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages