Inspiration

While working with Retrieval-Augmented Generation (RAG), I noticed something uncomfortable:

Even well-built pipelines miss critical information.

Similarity-based retrieval often produces answers that look correct, but incomplete. There is no built-in way to detect what is missing or to verify improvement.

As a software architect, this is a problem. Traditional systems are predictable and testable. RAG systems can silently fail.

The question became:

Can a system detect its own mistakes and improve within the same execution?


What it does

EvoContext runs the same question twice and improves the result.

Run 1
Retrieves context using similarity, generates an answer, and produces a score (example: 62/100).

Evaluation
Identifies missing facts and produces targeted feedback.

Run 2
Uses that feedback to retrieve better context, generates a new answer, and scores higher (example: 84/100).

The system shows exactly what changed and why.


How we built it

EvoContext combines:

  • Vector retrieval (Qdrant)
  • Structured chunking with overlap
  • Rule-based evaluation (no LLM judge)
  • Feedback-driven query expansion
  • Full trace logging for every run

The evaluation layer defines expected facts and scoring rules, making the outcome deterministic and reproducible.


Challenges we ran into

  • Designing a scenario where Run 1 fails consistently without feeling artificial
  • Controlling LLM variability (temperature 0, strict output structure)
  • Preventing false positives in evaluation (negation handling, rule precision)
  • Making improvement measurable instead of subjective

Accomplishments that we're proud of

A demo where improvement is visible, measurable, and reproducible.

The system does not just generate a better answer. It shows:

  • which documents were retrieved
  • which facts were missing
  • how retrieval changed
  • how the score improved

What we learned

  • Similarity alone does not guarantee completeness
  • Evaluation must be explicit and rule-based to be reliable
  • Observability is required to debug RAG systems
  • Feedback loops can turn static pipelines into adaptive systems

What's next for EvoContext

  • Apply the approach to regulated domains (legal, medical, operations)
  • Extend beyond two runs into continuous improvement loops
  • Build tooling for authoring evaluation profiles at scale
  • Integrate as a component inside existing RAG systems

The goal is simple: systems that can measure their own quality and improve it.

Built With

Share this project:

Updates