Numeric guardrails on LLM narratives: how Fin4Sight prevents hallucinated KPIs

Post Image
Share:

An AI executive narrative is only useful if the numbers in it match the numbers in the underlying data. Most LLM-driven finance dashboards skip this check. Fin4Sight doesn't.

The promise of AI narratives in finance dashboards is straightforward: instead of reading 12 charts, you read a paragraph that explains what changed. The trouble is what happens when the model is confident and wrong.

How LLMs hallucinate finance figures

LLMs predict the next token based on the context they're given. If you ask one to summarize “AP aging exceeded last month's by 23%”, the model will write a fluent paragraph around that figure even if the actual number is 11%. The fluency is the problem — the narrative reads as though the number is grounded, but the number was just the most plausible-looking next token.

In a CFO dashboard, that kind of fluency is dangerous. A 23% claim where the truth is 11% gets quoted in the next board meeting. Auditors flag it later. Trust in the dashboard drops.

The numeric guardrail pattern

Fin4Sight handles this with a guardrail that runs after the LLM generates a narrative and before the narrative ships to the dashboard. Three steps:

  1. Extract every figure. Numbers, percentages, currency amounts, deltas — anything quantitative the narrative claims. Each figure is parsed out with a reference to which source aggregate it claims to summarize.
  2. Compare to the source. The platform recomputes each figure against the underlying aggregate that fed the LLM. The aggregate is real data — actual SAP figures, actual bank totals, actual SoD conflict counts.
  3. Reject on drift. If any figure drifts more than 5% from its source aggregate, the narrative is rejected. The dashboard shows a fallback (“AI summary unavailable for this period”) rather than a fabricated paragraph.

The user never sees a hallucinated figure. They either see a narrative whose numbers all reconcile to the underlying data, or they see no narrative.

Where this applies inside Fin4Sight

The same guardrail pattern runs across the modules that generate LLM narratives:

  • Bank Intelligence Cockpit — total fees, idle-cash days, anomaly counts in the executive narrative
  • Access Intelligence — conflict counts, user counts, change-diff totals in the executive narrative
  • Executive Intelligence Cockpit — module-level summaries (FI, CO, MM, SD, PP, HR) with anomaly severities and aggregate figures

Every quantitative claim in any of these narratives goes through the validation pass. One pattern, applied consistently.

The 5% threshold

5% is the rounding threshold the platform uses today. It's tight enough to catch fabrication and loose enough to allow normal LLM rounding behaviour (“approximately 25%” instead of “24.7%”). Tighter thresholds reject too many narratives that would have been useful; looser thresholds let through too many that wouldn't have been.

The threshold is configurable per tenant if you want stricter validation for an audit-sensitive workflow. The default is 5%.

What this means for your CFO and auditor

For the CFO: the narrative numbers always tie to the dashboard charts. No more “wait, that figure isn't right” in a board read-through.

For the auditor: every figure in an LLM-generated report is provably tied to a source aggregate. Open the report, open the source data, the numbers reconcile.

For the AP team: variance commentary, anomaly explanations, and reconciliation summaries all come with numbers you can trust without re-checking them.

Why most AI finance tools skip this check

It's hard. Extracting numbers reliably from LLM output isn't a one-liner — you need to handle currencies, scales (thousands vs. millions), percentages, and deltas. Recomputing the source aggregate means having a clean view of the real data the LLM was given. Rejecting narratives means fewer narratives ship, which makes the dashboard look quieter.

It's also less impressive in a demo. A dashboard that ships every narrative looks more capable than one that rejects 10% of them. Until the rejected narrative is the one that would have lied.

If you're evaluating AI for finance

Ask the vendor what happens when the LLM gets a number wrong. The good answers describe a validation pass and a reject behaviour. The bad answers describe how unlikely it is for the model to get a number wrong, which means the vendor hasn't built a guardrail and is hoping you won't notice.

Hallucinated KPIs are a discipline problem, not a model problem. Models will hallucinate; the question is what your tool does about it.