Systematic Reasoning

The reasoning layer for highly regulated industries.

We build systems that help organizations maintain continuous alignment between regulatory requirements, policy commitments, code and operational reality.

[email protected]

Back to home

Architecture Document · December 2025

Neuro-Symbolic Architectures for Audit-Grade Compliance AI

How to Use LLMs Without Trusting Them

Executive Summary

LLMs hallucinate 58-88% of the time on verifiable legal questions. RAG reduces this to 17-33%, but doesn't eliminate it.

The core problem isn't model capability—it's architecture. When you ask an LLM to assess compliance from retrieved documents, it must reason across hierarchy, exceptions, and evidence simultaneously. Errors compound. You can't tell which judgment was wrong.

The alternative: decompose complex questions into narrow sub-queries against explicit graph structures that encode regulatory hierarchy and organizational context. The LLM handles bounded interpretation at the leaves. Decomposition and composition stay deterministic and reviewable. This enables continuous re-evaluation as things change, proactive gap detection, and audit-grade traceability.

The paper draws on recent research—mathematical proofs of embedding capacity limits, empirical studies of hallucination in legal AI, and theoretical frameworks for why retrieval degrades at scale—to explain the architecture behind ForgeIQX, a continuous compliance platform for regulated environments.

By "neuro-symbolic," we do not mean rule engines with learned embeddings. We mean deterministic symbolic structure with neural interpretation strictly bounded at the leaves.

1. Why Probabilistic Systems Fail at Compliance

1.1 Hallucination as a Structural Problem

Empirical studies demonstrate that hallucination in legal and compliance contexts is not an edge case—it's inherent to how LLMs generate text. These models predict what plausible output looks like, not what's actually true.

Large Legal Fictions (Dahl et al., 2024) reports hallucination rates of 58–88% when LLMs answer verifiable legal questions (ranging from ChatGPT 4 at 58% to Llama 2 at 88%), with incorrect descriptions of court holdings in at least 75% of evaluated cases.

These errors persist even when models are prompted carefully and asked about well-defined factual matters.

The implication is not that LLMs are unusable, but that letting them generate answers freely is incompatible with audit-grade requirements. Incorrect answers cannot be tolerated or averaged away. The architectural question becomes: how do we constrain LLM involvement to tasks where some interpretation error is acceptable, while ensuring that compliance determinations rest on explicit, reviewable logic?

1.2 RAG Doesn't Fix It

Further work from Stanford RegLab (Magesh et al., 2024, "Hallucination-Free? Assessing the Reliability of Leading AI Legal Research Tools") identifies a structural failure mode termed contra-factual bias.

Key observations include:

When a user presents an incorrect premise, LLM-based systems frequently reinforce the error rather than correct it
RAG-based legal AI tools hallucinate between 17% and 33% of the time, even with domain-specific retrieval

Importantly, these failures occur even when relevant source material is retrieved. This indicates that retrieval alone does not enforce correct reasoning—when the LLM still has to figure out which rules apply and how, it gets things wrong.

In compliance settings, where user framing errors are common and consequences are asymmetric, this behavior represents a material risk. The failure mode is not retrieval quality but reasoning scope: an LLM asked to holistically assess compliance from retrieved documents must perform unbounded reasoning across hierarchy, exceptions, and evidence simultaneously. Errors in any dimension compound and become difficult to isolate.

1.3 Embeddings Can't Encode Hierarchy

Regulatory applicability is governed by hierarchy, scope, and exception precedence—not semantic similarity.

Research shows that embedding-based retrieval frequently fails to surface controlling clauses located outside semantically proximate text, and misses exception and precedence rules defined elsewhere in the regulatory structure.

Recent work establishes that these limitations are not implementation failures but fundamental architectural constraints:

Mathematical Capacity Limits. Weller et al. (Google DeepMind, 2025) prove that the number of top-k document subsets retrievable by any query is bounded by embedding dimensionality. This is not an empirical observation but a geometric constraint. Their results indicate that a 512-dimension model mathematically fails beyond approximately 500,000 documents, a 768-dimension model (common in production) fails beyond approximately 1.7 million documents, and a 1024-dimension model fails beyond approximately 4 million documents.

Positional Degradation. Liu et al. (Stanford, 2024) demonstrate that even when relevant information is retrieved, LLMs exhibit significant performance degradation—20-30% accuracy loss—when that information appears in the middle of long contexts rather than at the beginning or end. This "Lost in the Middle" phenomenon compounds retrieval limitations in multi-document compliance scenarios.

Empirical Precision Loss. Industry benchmarks (EyeLevel.ai, 2024) show retrieval precision degrading 10-12% per 100,000 documents in production RAG systems.

These limitations are architectural rather than model-specific. Semantic similarity does not encode legal hierarchy, and retrieval alone cannot reconstruct it reliably. The question is not whether better embeddings will emerge, but whether semantic similarity is the right primitive for compliance reasoning. The evidence suggests it is not.

1.4 Why Retrieval Breaks at Scale

The term "semantic collapse" has emerged in ML literature to describe the loss of discriminative power in high-dimensional embedding spaces at scale.

Wyss (2025) formalizes this phenomenon via the Semantic Characterization Theorem, demonstrating that continuous embedding manifolds collapse into discrete semantic basins—a phase transition where the apparent continuous nature of vector spaces resolves into functionally discrete regions.

In retrieval contexts, semantic collapse manifests as compression of similarity scores (relevant and irrelevant documents converge toward identical cosine similarity), loss of discriminative boundaries between semantic clusters, and increasing retrieval noise as corpus size grows.

This theoretical framework explains why RAG systems that perform well at demonstration scale (thousands of documents) degrade at production scale (hundreds of thousands to millions). The collapse is not a failure of implementation but a geometric property of the embedding space itself.

For compliance systems, semantic collapse represents an architectural constraint: vector-based retrieval cannot reliably surface controlling clauses, exceptions, or hierarchical relationships at the scale required for enterprise regulatory corpora. This constraint motivates the shift from retrieval-centric architectures to graph-structured decomposition, where hierarchy and applicability are encoded explicitly rather than inferred from semantic proximity.

2. Separating Policy from Context

2.1 Two Graphs, Not One

Recent research increasingly supports separating Policy Graphs (representations of regulatory obligations, prohibitions, permissions, scope, and exceptions) from Context Graphs (representations of real-world facts, systems, events, and evidence).

GraphCompliance (Xu et al., 2025) formalizes this approach and evaluates it empirically, finding a 4.1–7.2 percentage point F1 improvement over RAG and GraphRAG baselines on GDPR compliance tasks, along with improved interpretability through explicit reasoning paths produced by graph traversal.

The effectiveness of this separation lies in making reasoning failures observable. When policy logic and contextual facts are conflated, errors manifest as plausible narratives. When separated, errors surface as missing edges, violated constraints, or unresolved mappings—enabling deterministic review.

The separation enables a critical architectural property: complex compliance questions decompose into targeted sub-queries against explicit graph structures. Each sub-query receives enriched context from graph traversal, constraining the LLM to narrow interpretive tasks rather than open-ended reasoning. The decomposition and composition logic remains deterministic and reviewable; the LLM handles bounded semantic interpretation at the leaves.

Example: Query Decomposition

Consider a compliance question like "Does Certificate Authority X satisfy Baseline Requirement 4.9.1 given their current CPS?" In an LLM-centric architecture, the model receives retrieved documents and reasons holistically across scope, applicability, exceptions, evidence, and sufficiency simultaneously. In a graph-structured hybrid architecture, this decomposes into:

Is BR 4.9.1 applicable to CA X's certificate types? (graph traversal + bounded LLM interpretation)
Does CA X's CPS make a representation regarding 4.9.1? (graph traversal + bounded LLM interpretation)
Does the representation satisfy the requirement's conditions? (graph traversal + bounded LLM interpretation)
Are there applicable exceptions? (deterministic graph query)
Composition: assemble answers into compliance determination (deterministic logic)

Each sub-query is narrow, context-rich, and independently verifiable. The LLM never performs unbounded compliance reasoning; it interprets well-scoped questions with relevant context already assembled.

2.2 LLMs Fail at Rule Application

LegalBench (Guha et al., NeurIPS 2023) evaluates LLM performance across 162 legal reasoning tasks. Results show that pure LLM systems systematically fail tasks requiring rule application, and that hierarchical statutory reasoning is a consistent failure mode.

These findings reinforce the conclusion that compliance reasoning requires explicit structural representations—not just LLM inference. The failure pattern is instructive: LLMs perform well on semantic tasks (does this text discuss topic X?) but poorly on structural tasks (does exception Y override requirement Z given condition W?). This asymmetry suggests an architectural division of labor: use LLMs for the former, explicit structures for the latter.

2.3 Graphs Constrain the LLM

Surveys and empirical studies in the semantic web and NLP literature converge on the conclusion that knowledge graphs mitigate hallucination by constraining the generation space, reusing validated facts without retraining, and improving consistency across repeated queries.

These benefits arise from structure, not from increased model capacity. Where embedding-based retrieval asks "what is semantically similar?", graph-based retrieval asks "what is structurally connected?"—a fundamentally different operation that preserves hierarchy, exceptions, and applicability logic.

But the deeper architectural benefit is query decomposition. A compliance determination that might require an LLM to reason across hierarchy, exceptions, and evidence simultaneously instead becomes a series of narrow, context-rich sub-queries: Does this exception apply? Does this evidence satisfy this control? Is this requirement in scope? Each query bounds the LLM's interpretive task. Errors localize to specific sub-queries rather than disappearing into holistic generation.

This localization property is essential for audit-grade systems. When a compliance determination is incorrect, auditors and operators need to identify which specific judgment was wrong. In LLM-centric architectures, errors diffuse across the generation process. In decomposed architectures, errors trace to specific sub-queries with specific context, enabling targeted review and correction.

3. Adapting Without Retraining

3.1 Freeze the Model, Update the Tools

Recent research favors frozen-agent architectures, in which the LLM remains unchanged and adaptation occurs via updates to external tools and data structures.

AgentFly (2025) demonstrates that non-parametric approaches avoid high retraining costs, catastrophic forgetting, and latency between regulatory change and system correctness.

In regulated environments, the cost of retraining is not merely computational. Model updates introduce latency, reduce transparency, and complicate auditability. Tool- and graph-driven updates, by contrast, can be reviewed, versioned, and approved.

In this architecture, the LLM's role is bounded to semantic interpretation of well-structured queries, not compliance reasoning itself. The decomposition of complex questions into sub-queries, the traversal paths that enrich each query with context, and the composition of sub-answers into compliance determinations are all explicit, versioned, and reviewable. The LLM never sees an unbounded question like "is this organization compliant?" It sees targeted queries like "does paragraph 3.2.1 create an exception to requirement 4.1?" with the relevant text, hierarchy, and precedent already assembled.

This bounded role has important implications for system behavior:

Reproducibility: The same sub-query with the same context produces more consistent results than open-ended generation
Debuggability: When outputs are wrong, operators can identify which sub-query failed and why
Updatability: Changing how a requirement is interpreted means updating graph structure and traversal logic, not retraining the model
Auditability: The reasoning path is the sequence of sub-queries and their composition, not an opaque generation process

3.2 Context Beats Fine-Tuning

Work from Stanford, Berkeley, and industry partners shows that structured context and tool orchestration outperform fine-tuning for domain adaptation, and that context is more transparent, auditable, and reversible than parameter updates.

This approach supports deterministic review and controlled evolution, which fine-tuning does not.

The graph-structured approach represents an extreme form of context engineering: rather than fine-tuning the model to "know" compliance domains, the system assembles precisely the context needed for each interpretive sub-task. The model's general capabilities (language understanding, semantic comparison, text interpretation) are leveraged without modification; domain specificity lives entirely in the graph structure, traversal logic, and query decomposition.

4. Building the Knowledge Graph

4.1 LLMs Can't Build Ontologies

Multiple studies demonstrate that LLMs fail to independently generate complete domain ontologies, infer inconsistent or contradictory relationships, and accumulate noise over time without validation. These limitations are consistent across legal, ESG, and regulatory domains.

4.2 Start with the Schema

Ontology-guided pipelines constrain entity and relation types, enable schema and rule validation, and reduce extraction cost after initial ontology definition. Empirical evaluations show that this approach materially improves reliability, repeatability, and long-term graph quality.

For compliance domains, ontology-guided extraction has a specific benefit: the ontology encodes the structure of the regulatory domain itself. Requirements have applicability conditions. Controls have effectiveness criteria. Evidence has validity periods. These relationships are not discovered through extraction; they are defined by the regulatory framework and encoded in the ontology. Extraction then becomes a matter of instantiation—identifying which specific requirements, controls, and evidence exist—rather than relationship discovery.

5. Reacting to Change

5.1 Change Is the Input

In practice, compliance systems do not fail because users ask the wrong questions. They fail because the environment changes. Regulations are amended, interpretations evolve, systems are updated, controls drift, evidence expires, and operational practices diverge from documented intent.

ForgeIQX is architected around the assumption that change, not inquiry, is the primary input. Rather than waiting for a user to ask whether the system remains compliant, ForgeIQX continuously monitors for changes across policy sources, operational artifacts, and contextual signals that may affect compliance state.

This change-orientation aligns with the decomposition architecture: when a regulatory requirement changes, the system identifies which sub-queries are affected, re-executes them with updated context, and recomposes the compliance determination. The scope of re-evaluation is bounded by the graph structure, not open-ended.

5.2 Automatic Re-evaluation

When a change is detected, ForgeIQX triggers deterministic re-evaluation across the policy and context graphs. This re-evaluation does not seek to produce a narrative answer. Instead, it computes the impact surface of the change: which requirements are newly applicable or no longer applicable, which mappings are invalidated or weakened, which evidence artifacts are stale, missing, or contradictory, and which systems or controls are affected downstream.

This impact analysis is structural and repeatable. It does not depend on prompt phrasing or model variability. The graph traversal that identifies affected nodes is deterministic; only the interpretive sub-queries at the leaves involve LLM inference, and these are narrow, context-rich, and independently re-executable.

5.3 Output Gaps, Not Verdicts

The primary output of ForgeIQX is not a declaration of compliance, but a set of identified gaps and deltas. These gaps may include unmapped or partially mapped requirements, controls whose effectiveness assumptions no longer hold, evidence that no longer satisfies policy constraints, and divergences between documented procedures and observed practice.

By surfacing gaps rather than conclusions, the system supports proactive remediation and prevents compliance failures before they materialize in audits or incidents. This design choice intentionally prioritizes surfacing uncertainty and misalignment over producing authoritative answers.

The gap-oriented output reflects an epistemically honest position: the system identifies what it can determine structurally (this requirement exists, this mapping is missing, this evidence has expired) and bounds its interpretive claims to specific, reviewable sub-queries. It does not claim to "know" whether an organization is compliant; it identifies the specific questions that must be answered and surfaces where answers are missing, inconsistent, or outdated.

5.4 Humans Make the Call

Although ForgeIQX relies on deterministic decomposition and composition, it is not designed as an autonomous compliance authority.

Human operators retain decision authority over acceptance or rejection of mappings, updates to ontologies and schemas, interpretation of ambiguous regulatory language, and final compliance attestations.

ForgeIQX's role is to surface structured, reviewable discrepancies and to make the consequences of change explicit. This preserves accountability, supports audit defensibility, and aligns with how regulated organizations actually operate.

The decomposition architecture supports human review at multiple levels: operators can review individual sub-query interpretations, mapping decisions, evidence assessments, or the overall composition logic. Each level has explicit inputs and outputs. This granularity is not available in systems where compliance determinations emerge from holistic LLM generation.

5.5 Versioned State Over Time

Compliance systems must reason across time, not just across documents. The question is not only "are we compliant now?" but "when did alignment exist, when did it degrade, how long did gaps persist, and what was the interpretation when decisions were made?"

Research on systems such as MemGPT reframes memory as a managed resource and a prerequisite for longitudinal reasoning. In compliance contexts, this translates to versioned graphs and historical state tracking as essential architectural components.

ForgeIQX maintains temporal state across five elements: Requirements (external obligations that evolve over time), Representations (organizational commitments that change with policy updates), Practices (operational reality as reflected in evidence), Change (the events that trigger re-evaluation), and Time (the dimension that audits compress and drift exploits). By tracking alignment across these elements over time, the system produces audit-grade records showing not just current posture, but how posture evolved and when decisions were made.

The temporal dimension interacts with decomposition: when auditors ask "what was our compliance posture on date X?", the system can reconstruct the graph state at that point, re-execute the decomposed queries against historical context, and produce a point-in-time assessment with full traceability. This reconstruction is possible because the decomposition and composition logic is explicit and versioned, not embedded in model weights.

6. Evidence Summary

Finding	Representative Evidence
High hallucination rates in legal QA	Stanford RegLab (Dahl et al., 2024): 58-88%
RAG does not eliminate hallucination	Stanford RegLab (Magesh et al., 2024): 17-33%
Contra-factual bias persists under RAG	Stanford RegLab (Magesh et al., 2024)
Embedding capacity has mathematical limits	Google DeepMind (Weller et al., 2025)
Positional degradation in long contexts	Stanford (Liu et al., 2024): 20-30% accuracy loss
Semantic collapse formalizable	Wyss (2025): Semantic Characterization Theorem
Dual-graph improves compliance accuracy	GraphCompliance (Xu et al., 2025): 4-7% F1 improvement
Tool updates outperform fine-tuning	AgentFly (2025)
Ontologies required for decision-grade KGs	Multiple studies (2024-2025)

Detailed quantitative results are available in the cited studies and are summarized here to emphasize architectural implications rather than benchmark competition.

7. Conclusion

LLMs will keep getting better at semantic understanding, retrieval, and tool use. That will help with interpretation, extraction, and explanation.

But the architectural requirements don't go away as models improve. Deterministic decomposition, explicit hierarchy handling, temporal versioning, replayable decisions, human accountability—these matter more when outputs are confident, not less. A wrong answer you trust is worse than a wrong answer you check.

Even if future models internalize regulatory structure, you still need to externalize it for auditability, replay, and accountability. The separation of semantic interpretation from deterministic structure will remain necessary for audit-grade systems.

The research doesn't support LLM-centric architectures for audit-grade compliance.

It supports architectures that:

Encode rules, applicability, and query decomposition deterministically, constraining LLM interpretation to bounded sub-queries
Separate law from facts in explicit graph structures
Constrain LLMs to narrow, context-enriched interpretive tasks rather than holistic reasoning
Localize errors to specific sub-queries with specific context, enabling targeted review
Support temporal reasoning and change management through versioned state

Taken together, these findings indicate that graph-structured hybrid architectures—with deterministic decomposition, bounded neural interpretation, and explicit composition logic—represent the emerging baseline for compliance AI in regulated environments.

References

Dahl, M., Magesh, V., Suzgun, M., & Ho, D. E. (2024). Large Legal Fictions: Profiling Legal Hallucinations in Large Language Models. Journal of Legal Analysis, 16(1), 64-93. https://doi.org/10.1093/jla/laae003

EyeLevel.ai. (2024). Do Vector Databases Lose Accuracy at Scale? https://www.eyelevel.ai/post/do-vector-databases-lose-accuracy-at-scale

Guha, N., et al. (2023). LegalBench: A Collaboratively Built Benchmark for Measuring Legal Reasoning in Large Language Models. NeurIPS 2023.

Liu, N. F., Lin, K., Hewitt, J., Paranjape, A., Bevilacqua, M., Petroni, F., & Liang, P. (2024). Lost in the Middle: How Language Models Use Long Contexts. Transactions of the Association for Computational Linguistics, 12, 157-173. arXiv:2307.03172

Magesh, V., Surani, F., Dahl, M., Suzgun, M., Manning, C. D., & Ho, D. E. (2024). Hallucination-Free? Assessing the Reliability of Leading AI Legal Research Tools. arXiv:2405.20362

Weller, O., Boratko, M., Naim, I., & Lee, J. (2025). On the Theoretical Limitations of Embedding-Based Retrieval. Google DeepMind. arXiv:2508.21038

Wyss, C. M. (2025). How to Tame Your LLM: Semantic Collapse in Continuous Systems. arXiv:2512.05162

Xu, et al. (2025). GraphCompliance: GDPR Compliance Checking with Dual Knowledge Graphs.

Appendix A: How This Differs from RAG

The following comparison clarifies architectural distinctions between common classes of AI systems used for compliance and governance. It is not intended as a product comparison, but as a structural taxonomy grounded in the research discussed in this paper.

Figure 1: (a) LLM-centric retrieval architectures follow a linear, query-driven flow. (b) Graph-structured hybrid architectures implement a continuous, change-driven cycle with human review.

	LLM-Centric Retrieval	Graph-Structured Hybrid
Where reasoning happens	Inside the language model (unbounded)	Deterministic decomposition with bounded LLM interpretation
What the LLM does	Generate answers and summaries	Interpret narrow, context-enriched sub-queries
Query structure	Open-ended questions with retrieved context	Decomposed sub-queries with assembled context
What triggers the system	User asks a question	Change detected in policy, environment, or evidence
Hierarchy and exceptions	Implicit, inferred by the LLM	Explicitly encoded in graph structure
When something's wrong	Hard to find; errors diffuse across generation	Trace to specific sub-query with specific context
Adapting to change	Re-index or retrain	Update graph, ontology, and mappings
What you get out	Natural-language responses	Identified gaps, deltas, and impacts
Audit story	Narrative justification	Traceable decomposition and composition
Point-in-time reconstruction	Not supported	Replay queries against historical graph state

Appendix B: A Concrete Example

To illustrate the architectural difference, consider the compliance question:

"Does Organization X satisfy PCI-DSS Requirement 8.3.1 regarding multi-factor authentication?"

LLM-Centric Approach

Retrieve documents mentioning "PCI-DSS," "8.3.1," "multi-factor authentication," and "Organization X"
Assemble retrieved chunks into context
Prompt LLM: "Based on the following documents, determine whether Organization X satisfies PCI-DSS Requirement 8.3.1…"
LLM generates holistic response

Failure modes: LLM may miss exceptions, misinterpret scope, conflate requirements, or produce confident incorrect answers. Errors are difficult to isolate.

Graph-Structured Hybrid Approach

Decomposition (deterministic):

Retrieve PCI-DSS 8.3.1 from policy graph with full hierarchy and applicable exceptions
Identify Organization X's systems in scope for 8.3.1 from context graph
For each in-scope system, identify mapped controls and evidence

Sub-queries (bounded LLM interpretation):

Q1: "Does the following system description indicate administrative access capability?" [system description + schema context]
Q2: "Does the following control documentation describe multi-factor authentication?" [control doc + requirement text]
Q3: "Does the following evidence demonstrate MFA implementation?" [evidence artifact + control specification]

Composition (deterministic):

If all in-scope systems have mapped controls with valid evidence → requirement satisfied
If any system lacks mapping → gap identified with specific system
If any evidence is expired or insufficient → gap identified with specific evidence

Failure modes: If Q2 is answered incorrectly, the error traces to that specific sub-query with that specific control documentation. Review and correction are targeted.