Why we published a negative result on ZNE before publishing the positive one.
The v1.2 paper's Appendix D.1 reports an experiment that failed. We tried to extend Qlro's one-shot calibration into a zero-noise extrapolation (ZNE) method, and unconstrained Richardson beat every calibration-informed variant we wrote. We published this before publishing the follow-up paper on what did work.
What we tried
The thesis: if the Circuit Survival Estimator's calibrated parameters already describe the device's decay law, we should be able to plug those parameters into a ZNE fit and read F(0) off a physics-aware curve instead of the generic polynomial Richardson uses. Four variants:
- Mitiq-style Richardson (the baseline — calibration-free)
- Rigid adaptive: fix the decay shape to K(λ) from our calibration, fit only an amplitude coefficient
- Prior-regularised: add a deviation exponent γ for per-circuit flexibility
- Direct inversion: a single-shot F_ideal = F(1) / K(1)
Three out-of-sample ladder circuits on IQM Garnet (5, 10, and 20 CNOTs). Each method computed against Aer-verified ideal.
What actually happened
Unconstrained Richardson got to a mean residual of 0.103. Our rigid-adaptive fit won at the middle depth (10-CNOT MedLadder, residual 0.014) and then catastrophically overshot the deep ladder (20 CNOTs, F(0) = 1.585 — a fidelity above 1.0, which is physically impossible). Direct inversion clipped to 1.0 on the deeper circuits, carrying no information. No calibration-informed method was consistently better than Richardson across depths.
The diagnosis is that the 2-parameter calibration (ε2q, dc) is degenerate: many parameter pairs fit the two calibration-circuit observations equally well, and the extrapolation to intermediate depths lands at different places depending on which pair the fitter picks. Richardson wins because it uses zero calibration — there is no degenerate fit to mis-resolve.
Why it's Appendix D, not a footnote
A methods paper that only reports experiments that worked invites the correct reviewer question: "which experiments did you try and not report?". The honest record is this: we tried ZNE, we tried four variants, we logged every Braket ARN, we concluded the extension does not carry through at the current calibration size, and we said so by name.
Two second-order reasons it was worth the page count:
- The failure taught us the forward equation. Writing down why the methods failed forced us to derive the global-depolarizing-channel forward model
F = K · F_ideal + (1 − K) · F_uniformexplicitly. That equation is the one the mixture-consistent predictor in the main body uses. The useful derivation fell out of the experiment that failed. - The asymmetry between mitigation and mapping. Appendix D.2 reports a different extension — measurement- informed qubit mapping — that succeeded on the same hardware. Why? Mitigation demands numerically accurate K so that F_ideal = F_observed / K lands on the right number. Mapping demands only that the rankingof K(chain_A) vs K(chain_B) be preserved. Calibration drift kills mitigation because the absolute number moves; it doesn't kill mapping because the ranking survives.
The commercial reason
Qlro sells device-selection decisions to buyers who are audited on those decisions. The trust model depends on the buyer being able to verify that our successes are not cherry-picked from a larger pool of unreported failures. Publishing the negative result is how we exchange that trust for a long-term relationship instead of a one-time sale. Customers reading the procurement docs for our Enterprise tier can see, in our own paper, what we tried and what did not work.