First Out-of-Sample Check of the Composite Quintile Contrast
90-day snapshot dataset · 35-day matured OOS window · BTC + ETH · score_v3
RiskState Research · 18 May 2026
Abstract
We test whether a previously observed in-sample relationship between the RiskState composite score and 7-day forward BTC/ETH outcomes survives out-of-sample validation. The pre-specified statistic compares the pooled win rate of middle composite quintiles, Q2∪Q3, against higher composite quintiles, Q4∪Q5. The statistic and its accompanying inference procedure were pre-registered internally on 8 April 2026, before the out-of-sample window opened (the pre-registration record is held in a private repository; audit available on request). Using a 90-day snapshot dataset from 17 February 2026 to 18 May 2026 with the same date as cutoff, the primary score_v3 rescored analysis produces an in-sample pooled gap of +13.6 pp, 95% CI [−0.2, 25.1] and an out-of-sample pooled gap of −3.2 pp, 95% CI [−30.7, 23.9]. Per-asset decomposition shows that BTC and ETH directions agree in-sample (BTC +13.5 pp, ETH +13.1 pp) but disagree out-of-sample (BTC −6.7 pp, ETH +1.8 pp). Under the Backtest framework's stated policy that pooled results are headline-citable only when per-asset directions agree on the window, the pooled OOS gap is suppressed. No half clears the pre-specified confirmation criterion (95% CI excludes zero across all sensitivity block lengths, with per-asset agreement). The result is classified as OOS weakened, not disproven: effective OOS sample size is too small to support a stronger claim in either direction. The score_v3 freeze remains in place and the same test will be repeated after additional forward data accumulates.
Plain-language summary
We previously observed, in-sample, that win rates on 7-day BTC and ETH outcomes were higher in the middle of the composite-score distribution than at the top. After collecting 35 additional days of matured forward returns, that observation does not survive the same test under the current scoring rule. Pooled across BTC and ETH, the out-of-sample point estimate flipped sign and the confidence interval includes zero. Breaking pooled into per-asset components reveals that the two assets disagree on direction out-of-sample, which under our own stated policy means the pooled number should not be treated as a headline. Both halves' confidence intervals also include zero on every block length we tested. We are not using this contrast as a live claim. The scoring engine remains under the same version it was on the cutoff date, by policy, and we will re-run the identical test once more forward data has accumulated.
1. Background
In a series of internal analyses through April 2026 we observed that pooled win rates for the BTC + ETH 7-day return horizon were higher in the middle quintiles of the composite score than in the upper quintiles. The pattern was originally motivated by an apparent inverted-U shape across the full quintile distribution; the pre-specified test reported here evaluates only the Q2∪Q3 versus Q4∪Q5 leg of that shape and is therefore best read as a middle-versus-high composite quintile contrast rather than as a full inverted-U test. The strongest prior in-sample estimate (Aristotle round 2, 8 April 2026) reported a gap of +21.7 pp with 95% CI [10.6, 34.6] — but under a different contrast definition (rank-based tercile rather than quintile, and using stored mixed-version composites, not uniform rescoring). That number is not the in-sample baseline for the test reported here; the corresponding in-sample number under the present test is +13.6 pp, [−0.2, 25.1].
Because the in-sample window was short and the test contained discretionary elements (multiple block lengths, multiple horizons, multiple quintile groupings), the editorial policy of this research program requires that the finding be revalidated on data not available at the time the hypothesis was stated. The OOS window opened on 9 April 2026 and was scheduled to close at 90 days. This paper reports the result of that scheduled test.
2. Data
The validation dataset is the full collection of RiskState snapshots covering 17 February 2026 to 18 May 2026, approximately 90 calendar days. Snapshots come from two independent automated processes plus occasional manual saves:
| Source | Cadence | Storage | Count |
|---|---|---|---|
| Server cron | Every 4 hours at :00 UTC | Netlify Blobs (snapshots store) | 534 |
| Browser auto-capture | Every 4 hours when the dashboard is open | localStorage.likido_auto_snapshots | 217 |
| Manual saves | User-triggered | merged at read | 1 |
| Merged total | — | — | 752 |
The two automated sources share intent but not exact timing — server cron fires at :00 of every fourth hour; browser auto-capture fires ≥ 4 h after the previous browser snapshot, conditioned on the user holding the page open. The streams have effectively zero ts_epoch overlap.
Of the 752 snapshots, 709 are paired with non-missing 7-day forward returns for both BTC and ETH. Forward returns are filled by a separate scheduled function that reads Binance klines after the relevant horizon has elapsed. The 43-snapshot gap between total and paired is dominated by the most recent week, where the 7-day horizon has not yet matured. Under the rescoring path, the IS half is further restricted to snapshots with full ETH indicator coverage — this drops the IS subset from 395 to 311 and the matched OOS subset is 314, so the rescored analysis operates on 311 + 314 = 625 paired snapshots (the remaining 709 − 625 = 84 are pre-9-March-2026 snapshots without ETH-side inputs). The §4.3 full-pool sensitivity table uses all 709 paired observations under the stored variant and only the 625 with full rescoring inputs under the rescored variant.
The effective out-of-sample window is shorter than the nominal post-cutoff calendar span. The cutoff is 8 April 2026 and the snapshot dataset closes on 18 May 2026 (40 calendar days). The 7-day return horizon must mature before each observation enters the analysis, so the matured OOS span is approximately 35 days. This paper therefore reports a 90-day snapshot dataset with a 35-day matured OOS window. The two should not be confused.
Data-quality threshold: the analysis applies no minimum qualityScore exclusion. All paired snapshots with valid 7-day returns enter the test, including those with one or more MOCK or fallback data sources. This choice is intentional for this paper — the pre-specified statistic does not depend directly on qualityScore, and excluding low-quality snapshots after the fact would risk biasing the in-sample / out-of-sample comparison if quality differs across the two halves. The unfiltered exclusion count is therefore zero. A separate qualityScore ≥ 80 subset is available in the validator script for sensitivity work but is not the pre-registered analysis path.
The dataset has additional structural caveats:
- ETH coverage begins 9 March 2026. Snapshots before that date do not record ETH-side indicators in the form required by the rescoring path. Under the rescored analysis below, this reduces the IS half from 395 to 311 paired observations and shortens its matured span from 49 d (pre-ETH-coverage stored variant) to 35 d (post-ETH-coverage rescored variant).
- Browser auto-snapshot loss event, May 2026. Approximately 140 historical browser-side auto-snapshots in the pre-cutoff window were lost in early May 2026 due to a now-fixed rotation cap. The dataset used here cannot fully reconstruct the snapshot set that was available at the time the original IS finding was stated. The in-sample contrast computed in this paper is therefore not a bit-exact replication of the original; it is a recomputation on the largest comparable surviving subset.
- Effective sample size is much lower than nominal for autocorrelated subsets. Adjacent 4-hour snapshots are not independent — both indicators and forward returns overlap. Bartlett-corrected
N_effis reported alongside nominalNthroughout. The stored-composite IS subset is an exception: its lag-1 autocorrelation is close to zero, which appears to reflect the mixture of browser-source and server-source snapshots within that subset producing two interleaved processes with different stored composite values at near-identical timestamps. The rescored analysis recovers the expectedN_eff << Nshrinkage by applying a single formula uniformly. We do not adjust this artefact away because it is a faithful read-out of the stored data; we label the stored variant as secondary (§3.7) and base the headline verdict exclusively on the rescored analysis.
3. Methodology
3.1 Pre-specified statistic
For each composite-score quintile Q_k, the pooled BTC + ETH win rate is
WR(Q_k) = (#{r in Q_k : r_BTC_7d > 0} + #{r in Q_k : r_ETH_7d > 0})
/ (|Q_k|_BTC + |Q_k|_ETH)
The pre-specified contrast statistic is
Delta = WR(Q2 union Q3) - WR(Q4 union Q5)
expressed in percentage points. A positive value indicates a higher win rate in the middle of the composite distribution than at the upper edge. We will refer to this throughout as the middle-versus-high composite quintile contrast. It is not a full inverted-U test; the lower-edge quintile Q1 is excluded from the pre-specified contrast.
We additionally report two per-asset variants that compute the same statistic on a single asset's returns: Delta_BTC and Delta_ETH. The composite-score quintile boundaries are derived once, from the pooled composite-score distribution of the analyzed subset (so quintile membership is identical across the BTC-only and ETH-only analyses); only the outcome variable (which asset's 7-day return) differs. The variants therefore test whether the same composite-score ranking predicts each asset's outcome separately. They are used to verify direction agreement between assets on a given window, per the framework's pooled-only-when-agreed policy (§3.5).
3.2 Inference
We use circular block bootstrap (Politis & Romano, 1994) with primary block length 21 days and sensitivity at 14 and 28 days. Block length is computed from the median inter-snapshot gap in the analyzed subset, so it adapts to the realised sampling cadence. We confirm below that both IS and OOS halves resolve to a 21-day primary block at their respective cadences (effective block lengths in both halves of the rescored analysis are 21 d / 21 d).
We draw 2 000 bootstrap resamples (seed 42) and report:
- The observed gap
Delta_obs - The bootstrap mean
mean(Delta_boot) - The 95% percentile confidence interval
[Delta_0.025, Delta_0.975] P(Delta <= 0)as the bootstrap-empirical probability that the contrast is non-positive
Effective sample size is computed from the Bartlett-style autocorrelation correction
N_eff = N / (1 + 2 * sum_{k=1..K} rho(k))
with truncation at K = floor(N^(1/3)) and the standard convention that negative autocorrelation terms are clipped to zero before summation (this prevents N_eff > N from underestimated negative autocorrelation). The implementation operates on the composite-score time series within each analyzed subset.
3.3 IS / OOS cutoff
The cutoff is 2026-04-08. Snapshots with ts < 2026-04-08T00:00:00Z are the in-sample (IS) half. Snapshots with ts ≥ 2026-04-08T00:00:00Z are the out-of-sample (OOS) half. Both halves are processed identically.
3.4 Pre-registration
The pre-specified statistic, inference procedure, and confirmation criterion (§3.5 below) live in scripts/validate-backtest.mjs in the RiskState repository. The validator script was first committed on 2026-04-08, 22:56 UTC — before the OOS window opened at 2026-04-09 00:00 UTC. Subsequent commits to the validator add the --cutoff flag and per-asset decomposition used to produce the tables in §4, but do not change the underlying contrast, the bootstrap procedure, or the confirmation criterion. The repository is currently private; the commit history (including the original timestamp and the pre-OOS state of the validator) is auditable on request.
Within the framework of pre-specified directional hypotheses, the middle-versus-high quintile contrast was the sole statistic specified for this window. Other quantities the validator reports — quintile Sharpe ratios, permutation test on max-|r| across horizons, spanning regression for subscore redundancy — are diagnostics, not hypothesis tests, and are not used to draw the verdict in §4.
3.5 Pre-specified confirmation criterion
A finding is considered confirmed by this test on a given half (IS or OOS) only when all of the following hold for that half:
- The 95% percentile CI excludes zero at the primary 21-day block length.
- The 95% percentile CI excludes zero at the 14-day sensitivity block length.
- The 95% percentile CI excludes zero at the 28-day sensitivity block length.
- Per-asset BTC-only and ETH-only contrasts agree on the sign of the bootstrap mean on that half.
Conditions (1)–(3) constitute the Backtest framework's "across all block lengths" standard; this paper does not relax it to the primary block alone. Condition (4) operationalises the "asset-level first, pooled second" policy: if BTC and ETH directions disagree on a window, the pooled headline for that window is suppressed regardless of whether the pooled CI excludes zero. The criterion is evaluated independently on the IS half (yielding in-sample confirmed if all four hold) and on the OOS half (yielding OOS confirmed only if all four also hold on OOS).
Failure to meet all four conditions on a half results in classification by the four-state taxonomy described in the Backtest page: suggestive (in-sample CI weakly excludes zero but not the full criterion), in-sample confirmed (in-sample meets all four), OOS weakened (in-sample met some criteria but OOS did not preserve them), OOS confirmed (both halves meet all four).
3.6 Primary analysis: rescored composite
The primary analysis quintile-ranks every snapshot using the current normalizeSignals formula (score_v3), applied uniformly across the entire dataset. The IS and OOS halves are thus compared under a single scoring rule. This is the honest test of the current signal under the current scoring rule, and the verdict in §4.5 is based on this variant.
3.7 Secondary analysis: stored composite
A secondary analysis quintile-ranks each snapshot by the composite.overall value as it was stored at snapshot time. Because the scoring version changed during the dataset's calendar span (score_v2 → score_v3 cutover at PR2 on 21 April 2026, inside the OOS half), this variant mixes two scoring versions within the OOS subset. The IS half is entirely under score_v2 stored composites. We report this variant for historical continuity with prior internal analyses; it is not used to draw the headline verdict, and a reader interested only in the current-scoring conclusion should skip to §4.
4. Results
4.1 Primary result — rescored score_v3, pooled
| Half | N paired | N_eff | Matured span | Gap | 95% CI | P(Delta ≤ 0) |
|---|---|---|---|---|---|---|
| IS | 311 | 49 | 35 d¹ | +13.6 pp | [−0.2, 25.1] | 2.75% |
| OOS | 314 | 15 | 35 d | −3.2 pp | [−30.7, 23.9] | 58.45% |
¹ Post-ETH-coverage span; the underlying IS calendar window is 17 Feb 2026 → 8 Apr 2026 (≈ 50 d) but only the post-9-March-2026 portion enters the rescoring path, giving a 35-day data span. The 49-d figure in the §4.4 stored table is the realised timestamp span of that variant's paired snapshots, which start from the first paired observation rather than the calendar window edge.
Gap delta (OOS − IS) = −16.8 pp. Retention ratio = OOS gap / IS gap = −3.2 / 13.6 ≈ −0.24. We retain "retention ratio" as a convenient summary but note that a negative value is interpretively awkward: it indicates a sign reversal of the point estimate from IS to OOS, not a magnitude attenuation. A reader should treat the sign flip as the substantive finding, not the ratio.
The IS 95% CI marginally fails to exclude zero at its lower bound of −0.2 pp. Under the pre-specified confirmation criterion (§3.5), the IS result alone is therefore not in-sample confirmed; it is at best suggestive under the framework taxonomy. The OOS 95% CI is wide and centred near zero with the point estimate reversed. The pooled result alone, if taken at face value, would classify the outcome as OOS weakened — but the headline reading depends on condition (4) of the criterion, examined next.
4.2 Per-asset decomposition
Same primary (rescored) analysis, computed separately on BTC and ETH 7-day returns using identical quintile boundaries derived from the pooled composite ranking:
| Half | Asset | N paired | N_eff | Gap | 95% CI | P(Delta ≤ 0) |
|---|---|---|---|---|---|---|
| IS | BTC | 311 | 49 | +13.5 pp | [−1.5, 26.5] | 3.85% |
| IS | ETH | 311 | 49 | +13.1 pp | [−3.4, 26.3] | 5.40% |
| OOS | BTC | 314 | 15 | −6.7 pp | [−38.8, 20.1] | 67.55% |
| OOS | ETH | 314 | 15 | +1.8 pp | [−25.1, 29.2] | 42.20% |
| Window | BTC direction | ETH direction | Direction agreement |
|---|---|---|---|
| In-sample | +13.5 pp (positive) | +13.1 pp (positive) | AGREE |
| Out-of-sample | −6.7 pp (negative) | +1.8 pp (positive) | DISAGREE |
The OOS per-asset N_eff = 15 is identical across BTC and ETH because N_eff is computed from the composite-score time series of the subset (the same series for both rows), not from each asset's outcome series. The two assets' return autocorrelation structures can differ, but the autocorrelation of the ranking variable — the composite — is what drives the block-bootstrap inference here.
This is the substantive finding of the paper. In-sample, BTC and ETH show very similar positive gaps (~+13 pp each, near-overlapping CIs), so the pooled IS number is a faithful summary of the per-asset picture. Out-of-sample, BTC and ETH disagree on direction: BTC reverses to −6.7 pp while ETH retains a small positive +1.8 pp. Under the framework policy stated in §3.5 condition (4), pooled OOS results should only be treated as headline-citable when per-asset directions agree. The pooled OOS pp number is therefore suppressed as the headline. The result of this window is not a single pooled gap; it is "in-sample, both assets behave consistently in the positive direction at a marginal level; out-of-sample, the two assets do not behave consistently."
4.3 Full-pool sensitivity
Without the cutoff split, the primary 21-day-block contrast on the full pool is:
| Variant | N paired | Gap | 95% CI | P(Delta ≤ 0) | CI > 0 |
|---|---|---|---|---|---|
| Rescored | 625 | +6.0 pp | [−7.6, 18.4] | 18.75% | no |
| Stored | 709 | +12.7 pp | [−1.2, 26.6] | 4.05% | no |
At the 21-day primary block neither variant excludes zero. At the 28-day sensitivity block, the stored variant nominally clears zero (+13.3 pp, [1.2, 25.2]) but this CI is unreliable: a 28-day block on a 90-day calendar dataset gives only ~3 independent blocks per resample, which makes the percentile bootstrap poorly behaved. The 14-day sensitivity block on the stored variant produces +12.1 pp, [−4.3, 27.7] — also crossing zero. The pre-specified criterion (§3.5) requires all three block lengths to exclude zero; this is not met by either variant.
4.4 Secondary result — stored composite (historical continuity)
For continuity with prior internal analyses, the same contrast computed on stored-composite quintile ranks:
| Half | N paired | N_eff | Matured span | Gap | 95% CI | P(Delta ≤ 0) |
|---|---|---|---|---|---|---|
| IS | 395 | 395² | 49 d | +12.4 pp | [−3.8, 27.5] | 6.30% |
| OOS | 314 | 26 | 35 d | +7.7 pp | [−27.1, 34.2] | 30.70% |
² The IS stored N_eff = N is unusual and is discussed in §2: the lag-1 autocorrelation of the stored composite time series within this subset is approximately zero, an artefact of mixing browser-source and server-source snapshots that produce two interleaved processes with different stored composite values at near-identical timestamps. The rescored primary analysis is not affected by this artefact. We report N_eff = N as observed rather than overriding it; a reader should treat the corresponding stored-variant CI with caution.
The stored-composite OOS point estimate remains positive but the 95% CI still contains zero on the primary block and per-asset agreement is not separately reported for this variant. Under the pre-specified confirmation criterion this variant also fails to confirm.
4.5 Verdict
| Criterion (§3.5) | Status (OOS) |
|---|---|
| (1) 21-day-block CI excludes zero | not met — OOS CI [−30.7, 23.9] contains 0 (also IS CI lower bound −0.2 pp, just inside zero) |
| (2) 14-day-block CI excludes zero | not met |
| (3) 28-day-block CI excludes zero | not met (or unreliable, full-pool only) |
| (4) Per-asset direction agreement OOS | not met (BTC −6.7 pp, ETH +1.8 pp) |
Primary verdict: not confirmed out of sample. Classification under the framework's four-state taxonomy: OOS weakened, not disproven.
We do not classify the result as disproven because (a) the IS per-asset directions agree at a marginal positive magnitude, (b) the OOS effective sample size is small enough that the wide CIs cannot exclude a meaningful underlying effect, and (c) the per-asset OOS disagreement could be a function of regime asymmetry between BTC and ETH over the specific 35-day window rather than a structural failure of the contrast.
5. Limitations
- Effective sample size in the OOS half is low. Under rescoring,
N_eff = 15for both per-asset halves — well below the 30-paired-observation rule of thumb for block bootstrap inference. A back-of-envelope power calculation, treating the percentile CI half-width as approximately1.96 × SE, givesSE ≈ 13.9 ppon the pooled OOS gap and a minimum detectable effect at 80% power of roughly 35 pp. The OOS test as configured here could only have confirmed the in-sample +13.6 pp gap if the true OOS effect were materially larger than the in-sample point estimate, which is not what one would expect under the null of "the in-sample pattern reflects a real but stationary effect." This window is genuinely underpowered for the test we want to run. - The IS effect itself was already borderline under uniform scoring. The IS 95% CI of [−0.2, 25.1] just barely fails to exclude zero on its lower bound. Under the pre-specified confirmation criterion, the in-sample result is at best
suggestive, notin-sample confirmed. The OOS failure to replicate should therefore be read in light of an in-sample effect that did not itself meet the standard required to license a strong OOS claim. - The IS half is not a bit-exact replication of the original sample. As noted in §2, the May 2026 browser auto-snapshot loss removed approximately 140 snapshots from the pre-cutoff window that existed at the time the original IS finding was stated. The IS results in this paper are computed on the largest comparable surviving subset; they are not the original sample.
- Costs are not modeled. Because the reported statistic is a win-rate contrast rather than a return-distribution contrast, transaction costs can change absolute win rates — especially for observations with small positive 7-day returns where a 0.30% round-trip swap cost would flip the sign of the realised outcome. We do not expect uniform costs to be the primary driver of the quintile contrast, but we have not tested that assumption in this paper.
- No multiple-horizon correction. The 7-day horizon was pre-specified before data was available, so this specific test does not require Bonferroni-style correction across horizons. The validator additionally reports a max-
|r|permutation test across six horizons as a separate family-wise diagnostic; that diagnostic is not part of the pre-registered confirmation pipeline reported here. - Pre-March-9 snapshots lack ETH data. Under rescoring this restricts the IS half to 35 days starting 9 March 2026, against a 35-day matured OOS half starting 8 April 2026. The two halves are of equal duration but the structural market regimes in each window are not necessarily comparable.
- The pre-specified statistic is not a full inverted-U test. As stated in §1 and §3.1, the contrast compares
Q2∪Q3againstQ4∪Q5only and excludesQ1. A full inverted-U claim would require an additional pre-specified contrast against the lower edge; that test is not reported here.
6. Conclusions
The pre-specified test does not confirm the middle-versus-high composite quintile contrast on the available 90-day snapshot window with 35-day matured OOS span. We do not extend the contrast claim beyond the in-sample observation. Specifically:
- We do not assert that the contrast is generalisable. No CI excluded zero across all sensitivity block lengths, and per-asset directions disagree out-of-sample.
- We do not assert that the contrast is absent. The OOS effective sample size is too small to support an exclusion claim of that form, and the IS per-asset directions agree in the predicted direction.
- We do assert that the previously reported in-sample finding has not survived this window in a form that the test we pre-specified is able to detect, and that the pooled out-of-sample point estimate is not a defensible headline by our own stated framework policy because the per-asset components disagree.
The score_v3 scoring freeze through 22 October 2026 is maintained, consistent with the binding scoring-version discipline described in the Methodology page. The next eligible structural revision (score_v4) is at the freeze lift. By that date a materially larger forward OOS sample will have accumulated; the target for the next report is at least doubling the OOS N_eff and re-evaluating per-asset agreement on the longer window. Interim reports may be issued if material new findings emerge before the freeze lift. The freeze and the pre-specified test are not contingent on this window's result; both remain in force regardless of outcome.
This is precisely the purpose of the research program: separate durable risk structure from attractive but unstable in-sample patterns. A pattern that does not replicate is retired, not promoted.
Citation block
RiskState (2026). First Out-of-Sample Check of the Composite Quintile Contrast.
RiskState Research, 18 May 2026.
https://riskstate.ai/docs/research/oos-validation-2026-05-18
Pre-registered internally 2026-04-08, 22:56 UTC (audit on request).
Cutoff: 2026-04-08 · Block: 21 d / 21 d · Resamples: 2 000 · Seed: 42
Primary (rescored score_v3), pooled:
IS (35 d, N=311, N_eff=49) gap = +13.6 pp, 95% CI [−0.2, 25.1]
OOS (35 d, N=314, N_eff=15) gap = −3.2 pp, 95% CI [−30.7, 23.9]
Per-asset (rescored score_v3):
IS BTC gap = +13.5 pp, 95% CI [−1.5, 26.5] direction: positive
IS ETH gap = +13.1 pp, 95% CI [−3.4, 26.3] direction: positive → AGREE
OOS BTC gap = −6.7 pp, 95% CI [−38.8, 20.1] direction: negative
OOS ETH gap = +1.8 pp, 95% CI [−25.1, 29.2] direction: positive → DISAGREE
Verdict: not confirmed out of sample · classification OOS weakened, not disproven.
Plain-language: the pattern did not replicate; we retire the claim and will re-run
the same test once more forward data has accumulated.
BibTeX
@techreport{riskstate2026oos1,
title = {First Out-of-Sample Check of the Composite Quintile Contrast},
author = {{RiskState}},
institution = {RiskState},
year = {2026},
month = {May},
day = {18},
url = {https://riskstate.ai/docs/research/oos-validation-2026-05-18},
note = {First out-of-sample validation check.
Pre-specified middle-versus-high quintile contrast
on 7-day pooled BTC+ETH win-rate.
Pre-registered internally 2026-04-08 (audit on request).
score\_v3 freeze active through 2026-10-22.}
}
Reproducibility: the dataset, the validator and the seed used here are sufficient to reproduce every number in this paper exactly. The validator lives at scripts/validate-backtest.mjs in the RiskState repository, which is currently private; the script, the pre-registration commit history, and the snapshot dataset are auditable on request. The dataset is the merged snapshot pool described in §2; the server-side portion is produced by the take-snapshot scheduled function running every 4 hours on the RiskState production infrastructure. The seed is 42 throughout.