Same question, same answer — every time.
Quality measures whether the answer is good. Consistency measures whether the same question, asked tomorrow, returns the same answer. If a CFO asks for the pipeline-health number twice and gets two different figures, the tool is not usable — no matter how good either individual answer was.
What was tested
We took one of the benchmark questions and ran it 5 times in a row against each data layer — same model (Claude Opus), same budget, same prompt — and measured how often the system gave back the same answer.
Standard connectors
The AI rebuilds its analysis from scratch each run — different filters, different transcript samples, different bucketings. The answer drifts. One run in five, it never finishes at all.
Amdahl
The data layer is already organized, so the AI takes the same path through it every time. The numbers stay anchored. The accounts stay anchored. The quotes stay anchored.
Reproducibility across 5 identical runs
| Reproducibility | Amdahl | Standard connectors | Advantage |
|---|---|---|---|
Produced a usable answer did the system finish the task? | 5 / 5 runs | 4 / 5 runs | +1 run |
Numbers stayed the same across runs same percentages, counts, and dollar figures every time | 73% | 28% | +45 pts |
Findings repeated across runs did the same patterns and conclusions appear every time? | 71% | 47% | +24 pts |
Same accounts and customer quotes surfaced did the brief cite the same companies and verbatim quotes? | 62% | 38% | +24 pts |
Tools and queries used to investigate did the AI take the same path through the data each time? | Identical every run | Different every run | — |
Same prompt → same numbers → an answer you can put in front of a stakeholder twice.
That is what makes the output trustworthy enough to ship to a board, a customer, or a sales rep who needs the same answer the next time they ask.
Methodology
Five identical runs of the same question on each path, scored by an independent AI judge that clustered the findings, quotes, and named accounts across runs and reported what percentage appeared in every run.
- Same model (Opus) for both paths
- Same budget, same turn limit, same prompt
- Independent AI judge compared outputs across runs
- Measured: numeric consistency, finding overlap, entity overlap, tool-path identity
The full dataset, including all 10 run outputs and the per-run scoring traces, is available on request.
Run this benchmark on your data
Full methodology and raw outputs are available on request. The benchmark can be run against a slice of your real conversations.