Backtesting the Policy, Not the Signal

275 live policy decisions (Feb–Jun 2026) · 8-year multi-regime reconstruction (2018–2026) · BTC primary · score_v3.2 RiskState Research · 11 June 2026

← Back to Research index

Abstract

Every prior validation in this research program tested the signal: whether the composite score predicts forward returns. This paper tests the product: whether sizing by the engine's policy output, max_size_fraction, contains losses better than benchmarks that are held to the same average exposure. The exposure-matching design is deliberate and severe — it removes the trivial result that smaller positions lose less, and isolates whether the policy's time variation (when it permits more, when it permits less) adds value over simply being small. We run the audit twice: on the engine's 275 stored live policy decisions (17 February – 8 June 2026, mono-regime RANGE tape, no recomputation, no look-ahead), and on a 3,002-day retrospective reconstruction (March 2018 – May 2026) built from public exchange and on-chain history, spanning labeled BULL (1,409 d), BEAR (894 d) and RANGE (699 d) regimes. Three findings. First, the policy's exposure level dominates the context benchmark: versus buy-and-hold, deep-tail frequency at 7 days falls from 23.8% to 3.4% and maximum drawdown from 40.4% to 16.6% on the live record — but most of that is the mean exposure itself, which any constant cap would deliver. Second, the policy's dynamic component does not currently beat a constant cap at the 7-day horizon: on the live record the exposure-matched static benchmark shows fewer −2% tails (Δ = −3.5 pp, 95% CI [−7.3, −0.4]), better expected shortfall (ΔES95 = −0.70%, CI [−0.90, −0.22]) and better mean return, a result that is robust to daily subsampling and present in both halves of an IS/OOS split — while at the 24-hour horizon the dynamics clearly help (tail frequency 1.1% vs 2.9%). The same audit on the 8-year reconstruction finds no regime in which the reconstructed cap chain beats its exposure-matched static benchmark, and finds exposure-matched volatility targeting better on net in most cells. Third, a coefficient-elasticity and signal-redundancy analysis locates the likely cause: the engine's highest-elasticity hand-tuned coefficients sit in its momentum normalisation rather than its cap logic, two of its on-chain inputs are mathematically one signal, and realized volatility — the most validated sizing driver in the quantitative literature — does not currently drive position size at all. On this evidence we pre-register the leading score_v4 candidate, realized-volatility-first sizing, ahead of the 1 October 2026 decision gate; shadow telemetry capturing it on every server snapshot has been live since 11 June 2026. The score_v3.2 freeze (through 19 November 2026) is unaffected: every analysis here is read-only and every candidate is shadow-only.

Plain-language summary

A risk engine can look good for a boring reason: it tells you to hold small positions, and small positions lose less. To find out whether our engine does anything smarter than that, we compared it against benchmarks forced to hold the same average position size — one that never changes its size, and one that sizes purely by recent volatility. On the four months of live decisions the engine has made so far, and on an eight-year historical reconstruction across bull, bear and sideways markets, the honest answer is: the engine's level of caution is doing the heavy lifting; its day-to-day adjustments helped over the next 24 hours but slightly hurt over the next 7 days, and a plain volatility rule was on net the better adjuster. We traced the likely causes, published them, and registered — before the evidence cut-off for our next scoring version — exactly what we intend to change and how we will judge whether the change worked. The current scoring version stays frozen until then, as policy requires.

1. Background and motivation

The engine's commercial claim is loss containment, not alpha: max_size_fraction tells a trading system how much exposure is permitted given current conditions. The first paper in this series tested a signal-level claim (composite quintiles vs forward win rates) and reported that it weakened out of sample. But no analysis — internal or published — had ever tested the policy output itself as a sizing rule with benchmarks. That gap is uncomfortable in both directions: if the policy works, the evidence was being left unclaimed; if it does not, the product was being sold on an untested premise.

This paper closes the gap with the hardest fair comparison we could construct, and reports the result either way.

2. Design

2.1 Exposure matching — the core idea

Let ē be the mean of the policy's stored max_size_fraction over the evaluation set. Every decision-relevant benchmark is constrained to the same mean exposure:

Strategy	Exposure at time t	Question it isolates
POLICY	the stored historical `max_size_fraction`	the engine, as it actually decided
STATIC	constant ē	does varying exposure add anything over being small?
VOLTGT	`clamp(s / σ_realized,30d)`, with `s` set by bisection so the mean equals ē	is realized volatility the better dynamic driver?
BUYHOLD	1.0 (not exposure-matched)	context only — the magnitude of the level effect

Exposed return per observation is exposure × forward return. We interpret max_size_fraction as fully utilised — the most conservative reading of the policy as a sizing rule.

2.2 Metrics — loss containment first

Left-tail frequency (P(exposed 7-day return < −2% and < −5%)), expected shortfall (ES95 = mean of the worst 5% of exposed returns), maximum drawdown of a daily-resampled compounded equity curve, then Sortino/Sharpe/win rate. Paired deltas (POLICY − STATIC, POLICY − VOLTGT) are computed within each bootstrap resample.

2.3 Inference

Circular block bootstrap (Politis–Romano), 2,000 resamples, block lengths spanning 7/14/21 days as sensitivity, deterministic seed (42). Bartlett-corrected effective sample sizes are reported alongside nominal counts. Snapshot cadence on the live record is ~4.6 h with overlapping 7-day windows; autocorrelation is severe and the corrections mitigate rather than eliminate it.

2.4 Pre-registration status — stated honestly

The first paper's statistic was pre-registered weeks before its out-of-sample window opened. This paper's design does not have that property: the benchmarks, metrics and inference procedure were specified before any result was inspected, but within the same working session that produced them. We label the live-record and reconstruction results accordingly as a specified-before-inspection audit, one grade below full pre-registration. What is pre-registered by this paper, months ahead of its evidence cut-off, is the score_v4 candidate and its decision criteria (§6).

3. Part I — the live record

3.1 Data

275 stored policy decisions, 17 February – 8 June 2026 (261 with matured 7-day returns), mean exposure ē = 0.351. The stored values are the engine's actual historical output — nothing is recomputed, so look-ahead is structurally impossible. Regime composition is the period's known weakness: 267 RANGE, 7 SQUEEZE, 0 BULL, on a falling tape. Bartlett N_eff on raw 7-day returns is ~8; treat every interval below as low-powered.

3.2 The level effect (vs buy-and-hold)

At 7 days: P(exposed return < −5%) is 3.4% for POLICY vs 23.8% for BUYHOLD; ES95 −5.5% vs −13.6%; equity-curve maximum drawdown 16.6% vs 40.4%. This is the number a loss-containment product wants — but the next subsection shows how much of it is simply ē.

3.3 The dynamics (vs exposure-matched benchmarks)

At 7 days, POLICY − STATIC (positive = policy better):

Δ at 7d	Point	95% CI	P(≤0)
left-tail P(< −2%)	−3.5 pp	[−7.3, −0.4]	99.9%
left-tail P(< −5%)	−2.0 pp	[−6.1, 0.0]	100%
ES95	−0.70%	[−0.90, −0.22]	99.7%
mean return	−0.10%	[−0.20, −0.01]	98.7%

The static cap at the policy's own mean exposure had fewer tails, better expected shortfall and better mean return than the policy's time-varying output. The result survives a one-observation-per-day subsample (ES95 Δ −0.77%, CI [−1.23, −0.06]) and appears in both halves of an IS/OOS split at 21 April 2026.

At 24 hours the picture reverses: POLICY shows 1.1% tail-2 frequency vs STATIC's 2.9%, with better ES95. The cap chain reacts well to immediate stress and mistimes the weekly horizon.

Against VOLTGT the live record is unstable: vol targeting wins decisively before the 21 April cutoff, the policy wins after it (ΔES95 +1.23%, CI [+0.29, +1.49]). With N_eff ≈ 8 we do not interpret this beyond "no stable ranking on four months of one regime."

4. Part II — eight years, three regimes

4.1 What the reconstruction is, and is not

RETRO_v1 is a 3,002-day daily dataset (5 March 2018 – 23 May 2026) rebuilt from public history: exchange spot klines (price, RSI, realized volatility), perpetual funding (from 2020-03), the Fear & Greed archive, full-history MVRV from a public on-chain data provider (NUPL follows by identity), and macro series (DXY, 10-year yield, SPX). It is scored with the reconstructable subset of the live engine — momentum, on-chain valuation, macro, and (from 2020-03) funding — using the live normalisation formulas verbatim and the live cap tables imported, not transcribed. Flow signals (CVD, ETF, netflow, whale, positioning) cannot be reconstructed and are absent; the reconstructed policy chain therefore carries only its macro and cycle caps, with the rules and DeFi caps fixed at 1.0. Regime labels are a transparent kline-only rule (price vs SMA200 with 30-day slope). It is not the live engine and it is not the track record — it is the only available instrument for asking how this family of logic behaves outside a RANGE regime.

4.2 The signal does not generalise through the subset

The pre-specified quintile contrast (Q2∪Q3 − Q4∪Q5 win rate, 7-day), run on the subset composite over eight years:

Subset	n	Gap	95% CI	N_eff
POOLED	3,002	−1.0 pp	[−6.2, +4.3]	368
BULL	1,409	+3.5 pp	[−3.5, +10.6]	174
BEAR	894	+2.3 pp	[−6.7, +11.2]	132
RANGE	699	+1.7 pp	[−9.1, +12.6]	116

No cell clears zero. Adding the funding component makes BULL worse (−4.7 pp, CI [−13.7, +4.7]) — consistent with the hypothesis, flagged in the first paper's follow-up list, that the funding normalisation is regime-conditional. The February–April 2026 in-sample contrast either lives in the non-reconstructable flow signals or was period-specific; the reconstruction cannot distinguish these, but it rules out "the reconstructable core carries the edge."

4.3 The policy result repeats in every regime

The exposure-matched audit, per regime, 7-day horizon: the reconstructed cap chain beats its static benchmark nowhere. The sharpest cell is BEAR deep tails — POLICY − STATIC Δtail-5 = −1.45 pp, CI [−2.46, −0.56]: in the regime where a loss-containment product most needs its dynamics, the reconstructed chain produced more deep tails than a constant. Exposure-matched volatility targeting is on net the better dynamic rule (pooled Δtail-2 −1.3 pp, P(≤0) = 96%; RANGE ΔES95 −1.20%, P(≤0) = 95%).

Two independent datasets — four months of live decisions and eight years of reconstruction — agree: the value of the current policy is its level; its dynamics are not yet earning their complexity at the weekly horizon, and volatility is the better dynamic driver.

5. Part III — diagnosis

Two read-only analyses locate the likely causes.

Coefficient elasticity. Perturbing each of 17 hand-tuned coefficients by ±10–30% on the reconstruction and re-running both the contrast and the policy audit splits them cleanly. The highest-elasticity coefficients in the engine are the momentum (RSI) normalisation centres — modest perturbations move per-regime contrast gaps by up to ~7 pp — followed by the funding normalisation (where the choice of branch between percentile and parametric normalisation is itself a ~5 pp decision in BEAR), then the composite weights (2–3 pp). Every cap-side coefficient tested — cycle-cap table values, valuation-percentile clips, the capitulation-boost parameters — moves outcomes by approximately zero. The engine's calibration risk is concentrated in its normalisers, not its caps; the cap-side magic numbers can be documented and left alone.

Signal redundancy. Two of the on-chain inputs are mathematically one signal: NUPL is an exact transform of MVRV (rank correlation 1.000), so the on-chain family scores one signal twice through two curves. MVRV and ATH-drawdown correlate at 0.93 within the cycle inputs; DXY and the 10-year yield at 0.69 within the macro family. An eigenvalue decomposition of the live subscore correlation matrix puts the composite's effective signal count at ~6.0 of 7 nominal — moderate, but weight rebalancing at the next version should operate on signal clusters, not on subscores as if independent.

The missing driver. Realized volatility — the basis of the single most validated sizing rule in the literature, and the better benchmark in both Parts I and II — does not currently enter max_size_fraction at all. The engine's volatility regime is proxied from sentiment and positioning inputs rather than measured from returns.

6. What this evidence pre-registers for `score_v4`

The score_v3.2 freeze runs through 19 November 2026; the decision gate that selects score_v4 content runs on 1 October 2026. This paper pre-registers, as of 11 June 2026:

Candidate: realized-volatility-first sizing — clamp(σ_target / σ_realized,30d) as the primary dynamic multiplier on the cap chain's level, with the current sentiment-proxied volatility score demoted to a modifier. Candidate values at three target-volatility levels have been captured in a shadow block on every 4-hour server snapshot since 11 June 2026 (shadow output is excluded from the audit hash and from all live surfaces, per the freeze's shadow-scoring rules; a static guard in CI enforces the hash exclusion).
Decision criteria at the gate: the candidate ships only if, on the accumulated shadow + live record, exposure-matched comparison shows the vol-first rule (a) not worse than the live policy on 24-hour tails, and (b) better on 7-day tails and ES95, with per-asset directions agreeing — the same evidence standard the first paper applied to the signal contrast.
Also queued from this evidence: momentum-normalisation recalibration (highest elasticity), funding normalisation regime-conditionality (the BULL finding in §4.2), on-chain dedup (the MVRV/NUPL identity), and the previously disclosed server-side CVD acceleration fix — see the Known Scoring Divergence note published the same day as this paper.
What would falsify the candidate: shadow data showing vol-first sizing increasing 24-hour tails relative to the live policy, or failing to improve 7-day tails once the live record includes a non-RANGE stretch. Default on failure: extend shadow, do not ship — the same default the V4 remediation list applies everywhere else.

7. Limitations

Live record: one regime (RANGE), four months, N_eff ≈ 8 on 7-day returns. The dynamics exist substantially for regime transitions, which a mono-regime sample cannot reward. This is precisely why Part II exists, and why the live result alone would not justify §6.
Reconstruction: subset composite (roughly half the live engine's signal weight); partial cap chain (macro + cycle only); kline-only regime labels; MVRV from a current-chain recomputation (minor as-of revisions possible); funding history begins 2020-03; daily granularity. Findings are calibration inputs, not track-record claims.
Exposure interpretation: max_size_fraction is a cap, not necessarily a deployed size; this audit evaluates it as if fully used, the most conservative reading.
No costs: none of the strategies pay fees, funding or slippage; the omission is symmetric but favours the more active strategies (POLICY and VOLTGT) — which makes the static benchmark's showing conservative.
Pre-registration grade: specified-before-inspection, not month-ahead pre-registration (§2.4). The §6 registration is the corrective for the next cycle.

8. Conclusions

The audit answers the question a sophisticated buyer should ask first — "does your sizing rule beat a constant at the same average exposure?" — and the current answer is: at 24 hours yes, at 7 days not yet, and a plain volatility rule is the stronger dynamic driver on both datasets. The engine's loss-containment value today comes from its exposure level and its short-horizon reactions. That is a real, defensible product property — and an honest ceiling that the next scoring version now has a pre-registered, shadow-instrumented, falsifiable plan to raise. The freeze holds until the evidence decides.

Citation block

RiskState (2026). Backtesting the Policy, Not the Signal:
an exposure-matched audit of max_size_fraction.
RiskState Research, 11 June 2026.
https://riskstate.ai/docs/research/policy-backtest-2026-06-11

Design: specified-before-inspection · Blocks: 7/14/21 d · Resamples: 2 000 · Seed: 42

Live record (275 decisions, Feb 17 – Jun 8 2026, ē = 0.351, N_eff ≈ 8):
  vs BUYHOLD     7d P(<−5%) 3.4% vs 23.8% · maxDD 16.6% vs 40.4%   (level effect)
  vs STATIC(ē)   7d Δtail-2 = −3.5 pp, 95% CI [−7.3, −0.4]
                 7d ΔES95  = −0.70%,   95% CI [−0.90, −0.22]
                 24h tail-2 1.1% vs 2.9%                            (dynamics help short)

Reconstruction (RETRO_v1, 3 002 d, 2018–2026, BULL/BEAR/RANGE):
  contrast: no regime cell clears zero (pooled −1.0 pp [−6.2, +4.3])
  policy:   BEAR Δtail-5 vs STATIC = −1.45 pp, 95% CI [−2.46, −0.56]
  vol targeting better on net in most cells

Pre-registered for score_v4 (gate 2026-10-01): realized-volatility-first sizing,
shadow-captured per 4h snapshot since 2026-06-11. Freeze through 2026-11-19 maintained.

BibTeX

@techreport{riskstate2026policy1,
  title       = {Backtesting the Policy, Not the Signal:
                 an Exposure-Matched Audit of max\_size\_fraction},
  author      = {{RiskState}},
  institution = {RiskState},
  year        = {2026},
  month       = {June},
  day         = {11},
  url         = {https://riskstate.ai/docs/research/policy-backtest-2026-06-11},
  note        = {Exposure-matched policy audit on 275 live decisions plus an
                 8-year multi-regime public-history reconstruction.
                 Pre-registers realized-volatility-first sizing for score\_v4
                 (decision gate 2026-10-01). score\_v3.2 freeze through
                 2026-11-19 maintained.}
}