All Benchmarks
Consistency Benchmark

Same question, same answer — every time.

Quality measures whether the answer is good. Consistency measures whether the same question, asked tomorrow, returns the same answer. If a CFO asks for the pipeline-health number twice and gets two different figures, the tool is not usable — no matter how good either individual answer was.

Amdahl reproducibility73%
Standard connectors28%
Improvement+45 pts
Runs per path5

What was tested

We took one of the benchmark questions and ran it 5 times in a row against each data layer — same model (Claude Opus), same budget, same prompt — and measured how often the system gave back the same answer.

Standard connectors

The AI rebuilds its analysis from scratch each run — different filters, different transcript samples, different bucketings. The answer drifts. One run in five, it never finishes at all.

Amdahl

The data layer is already organized, so the AI takes the same path through it every time. The numbers stay anchored. The accounts stay anchored. The quotes stay anchored.

Reproducibility across 5 identical runs

ReproducibilityAmdahlStandard connectorsAdvantage
Produced a usable answer
did the system finish the task?
5 / 5 runs4 / 5 runs+1 run
Numbers stayed the same across runs
same percentages, counts, and dollar figures every time
73%28%+45 pts
Findings repeated across runs
did the same patterns and conclusions appear every time?
71%47%+24 pts
Same accounts and customer quotes surfaced
did the brief cite the same companies and verbatim quotes?
62%38%+24 pts
Tools and queries used to investigate
did the AI take the same path through the data each time?
Identical every runDifferent every run
Same prompt → same numbers → an answer you can put in front of a stakeholder twice.

That is what makes the output trustworthy enough to ship to a board, a customer, or a sales rep who needs the same answer the next time they ask.

Methodology

Five identical runs of the same question on each path, scored by an independent AI judge that clustered the findings, quotes, and named accounts across runs and reported what percentage appeared in every run.

  • Same model (Opus) for both paths
  • Same budget, same turn limit, same prompt
  • Independent AI judge compared outputs across runs
  • Measured: numeric consistency, finding overlap, entity overlap, tool-path identity

The full dataset, including all 10 run outputs and the per-run scoring traces, is available on request.

Run this benchmark on your data

Full methodology and raw outputs are available on request. The benchmark can be run against a slice of your real conversations.