Backtest

This page describes the framework RiskState uses to validate its outputs. It does not publish current results — dated findings, out-of-sample reports and version comparisons live in Research.

The goal of this page is falsifiability: a reader should come away knowing exactly what question each test answers, what could make a result invalid, and how a third party could reproduce the analysis against their own snapshot dataset.

Query the same framework via API. The infrastructure described below is also exposed as POST /v2/backtest/shadow (alpha) — submit your own daily portfolio path (up to 30 days) and get per-day policy assessment + aggregate distribution metrics across BTC and ETH. See API v2 → /v2/backtest/shadow. The endpoint is the evidence engine a pilot uses to validate "would the policy have helped me over the last 30 days?" without ever sharing PnL with RiskState.

Signal validation vs Policy validation

RiskState produces two classes of output, and validation operates on both:

  • Scoring outputs — composite, structural and tactical scores. Signal validation answers: does score X carry predictive information about forward returns?
  • Policy outputsdirection_bias, risk_permission_score, policy_level, blocked_actions, max_size_fraction. Policy validation answers: do policy outputs materially alter outcomes relative to a naive baseline?

Both strands share the same dataset and the same statistical machinery (pre-specification, permutation, block bootstrap, effective sample size). They are reported separately because they answer different questions: is the engine informative (signal) vs. is the engine useful (policy).

Signal validation is the more mature strand today. Policy validation is reported at framework level below and executed as the dataset matures — metrics like blocked-action counterfactuals or bucket MAE distributions require longer series with multi-regime coverage than the current dataset provides.

What we measure

Every API response is captured as a snapshot: the full input state (prices, indicators, caps, composite, structural/tactical scores, regime, volatility, policy level) plus a timestamp and a policy_hash. Snapshots are stored server-side.

A forward return is attached to each snapshot after the fact, computed from the actual market outcome at a fixed horizon:

WindowHorizonPurpose
24h~1 dayTactical signal edge
72h~3 daysShort-term structural tilt
7d~1 weekPrimary validation horizon
30d~1 monthLong-horizon structural validation (slow to accumulate)

Forward returns are filled from public price data (klines), both browser-side and server-side, so the process runs independently of whether the dashboard is open.

Alongside returns, every snapshot gets a max adverse excursion (MAE) and max favorable excursion (MFE) — the worst drawdown and best gain observed intra-window — enabling quintile edge-ratio analysis beyond raw directional returns.

What we are answering

Signal validation questions

  1. Does the composite score carry predictive information at a given horizon? — correlation analysis (Pearson + Spearman + permutation p-values), plus non-linear quintile analysis.
  2. Is the relationship monotonic? — a full Q1 < Q2 < Q3 < Q4 < Q5 ranking check, not just "do extremes beat each other."
  3. Does the signal survive when controlled for asset beta and sample overlap? — Sharpe per quintile (not raw returns), block bootstrap for time-series dependency, and rank-based tercile contrast for scale invariance.

Post-PR2 signal questions

From scoring_version score_v3 onward, the composite-only framing is insufficient. Four additional questions target the new architecture:

  1. Does tactical add edge conditional on neutral-to-strong structural? — does the combined (structural + tactical) view outperform structural alone when structural is in the neutral / stabilizing band?
  2. Do structural vetoes reduce left-tail outcomes? — when the combiner overrides a tactical LONG/SHORT with direction_layer = structural_veto, do forward outcomes in those windows have lower MAE than the vetoed direction would have captured?
  3. Does the regime-aware combiner outperform a fixed-weight blend? — benchmark the regime-dependent weights (PANIC / EUPHORIA / SQUEEZE / TREND-RANGE) against a fixed 50/50 or 60/40 blend.
  4. Are policy permissions more informative than the legacy composite alone? — does the risk_permission_score provide incremental predictive lift over composite.overall?

Policy validation questions (framework-level)

These sit one layer downstream of signal validation. The user consumes permissions, not scores; the promise of the engine is economic, not merely predictive.

  1. Does max_size_fraction guidance improve return / risk vs a constant-notional baseline? — holding a fixed notional through the whole dataset versus sizing per the engine's guidance.
  2. Do blocked_actions periods avoid worse outcomes than non-blocked periods of comparable regime? — counterfactual framing: in windows where NEW_TRADES is blocked, what would a naive opener have captured?
  3. Does MAE distribution differ across policy_level buckets in the direction the engine claims? — level 1–2 (BLOCK) should have right-skewed MAE distributions (more bad tails avoided) compared to levels 4–5.
  4. Does direction_bias capture forward drift conditional on regime? — directional edge in LONG_PREFERRED / SHORT_PREFERRED windows, separated per market_regime.

Questions 1–7 are tested on the current dataset. Questions 8–11 are reported at framework level here and executed in Research as the dataset matures.

Statistical tooling

Permutation test

For every correlation between composite (or any subscore) and forward returns, a permutation p-value is computed alongside the parametric (Fisher z) p-value. Composite-return pairs are shuffled 2000 times; the permutation p is the fraction of shuffled correlations with |r_shuffled| ≥ |r_real|.

A correlation is only flagged significant when both parametric and permutation tests pass. The reason: overlapping windows (adjacent snapshots share most of their forward-return period) make parametric p-values artificially small. Permutation tests are robust to this under the null.

Pre-specified quintile contrasts with circular block bootstrap

Every validation window declares a small set of pre-specified contrasts on quintile win rates before looking at that window's data. One such contrast is (Q2∪Q3 win rate) − (Q4∪Q5 win rate) at the 7-day horizon, reflecting an inverted-U shape observed in earlier data. It is one of several contrasts run in parallel — not "the" headline — and is reported alongside LOWESS shape, bucket MAE/MFE, regime-conditional decomposition and the PR2-native questions above.

95% confidence intervals come from a circular block bootstrap (Politis & Romano, 1994). Block length is chosen relative to the data's autocorrelation structure and tested at multiple lengths (14 / 21 / 28 days) for sensitivity. A result is considered confirmed only when the 95% CI excludes zero across all block lengths.

Small-sample guards: the bootstrap is skipped when block length ≥ n, and a warning is raised when block length is more than half of n. Contrast form was informed by earlier data; the contrast is pre-specified per out-of-sample window, not re-selected within the window.

Effective sample size (Bartlett)

Time-series autocorrelation means nominal n over-states the number of independent observations. N_eff is computed via Bartlett's autocorrelation-adjusted formula (positive-only ρ summation, capped at n) and used for df and Fisher z standard errors. Confidence intervals narrow honestly — they do not inherit the false precision of nominal n.

Sharpe per quintile

Raw quintile returns are confounded by per-asset volatility differences (BTC vs ETH do not have the same risk). The primary comparison across assets is Sharpe per quintile, removing the beta confound. Published Research posts default to Sharpe unless raw returns are specifically informative for the point being made.

Asset-level first, pooled second

Results are computed per asset (BTC-only, ETH-only) before any pooled BTC+ETH aggregation. Pooled results are reported only when per-asset directions agree on the window — same sign of effect, overlapping confidence intervals. When per-asset results disagree, the pooled aggregate is suppressed and the disagreement is surfaced explicitly; the engine-level conclusion is then per-asset, not averaged.

The reason: pooling can synthesise a "mean signal" that is not representative of either asset, especially when ETH has different structural dynamics (staking, L2 settlement, more volatile regime transitions) than BTC.

Fisher z-transform 95% CI

All correlation tables carry 95% CIs via Fisher z-transform, using N_eff rather than nominal n. CIs wider than 0.6 are flagged — they carry little information even when the point estimate looks strong.

Rank-based tercile contrast

A second validation layer, robust to scale changes: independently compute top vs bottom tercile win rates by rank (not by score level). Used specifically when comparing scoring versions (v1 vs v2, etc.), because a monotone transformation of the composite may shift quintile boundaries while preserving the signal. Rank-based tests tell us whether the information content changed, independently of the distribution.

LOWESS smoother

For visual inspection of the score → P(return > 0) relationship, a LOWESS smoother (tricube kernel) with bootstrap 95% band renders the true shape without imposing an arbitrary bin count. This is how we discover non-monotonic patterns (inverted-U, regime-dependent) that Pearson r would miss.

Reproducibility pipeline

The same statistical suite that runs inside the dashboard also runs offline as a standalone script, deterministic under a fixed seed, against an exported CSV of snapshots.

  • Deterministic seed — every randomised step (permutation, bootstrap resample, subsample) is seeded. Re-running the validator against the same CSV produces bit-identical output.
  • Block length sensitivity — every bootstrap result is reported at multiple block lengths; a finding that only works at one block length is flagged as non-robust.
  • Replications welcome — reviewers are invited to replicate the analysis against their own snapshot datasets.

What is and is not reproducible externally is enumerated below in Reproducibility bounds.

Validation maturity

Validation is an ongoing process, not a one-shot claim. Our current posture:

  • In-sample window: the accumulated snapshot dataset at the date of each Research post.
  • Out-of-sample gate: a pre-committed follow-up at a longer horizon (typically +60 days) before a finding graduates from "suggestive" to "confirmed out-of-sample".
  • Scoring version gates: whenever scoring_version changes (semi-annual minimum — see Methodology → Versioning), the prior version's validation does not automatically transfer. A new in-sample check and out-of-sample gate apply.
  • Regime coverage warning: a finding from a dataset covering a single macro regime is labelled as such and not extrapolated.

Known sources of bias we actively manage

  • Overlapping windows. Adjacent snapshots at a few hours apart share most of a 7d forward return. Treating them as independent inflates significance roughly by sqrt(window_length / sampling_interval). Permutation tests, block bootstrap and N_eff are the three mechanisms used to prevent this.
  • Linearity trap. The composite-return relationship is often non-monotonic (inverted-U, regime-dependent). Pearson r on the full range can be near zero even when a clear signal exists at specific quintiles. Quintile and LOWESS analyses are the correct tools; the correlation table is context, not the headline.
  • Scoring-version drift. Rescoring old snapshots with current scoring_version is available but must decouple snapshot asset / macro state from whatever the live dashboard currently shows — the rescoring pipeline explicitly passes the snapshot's own context rather than reading current globals. A/B comparisons between scoring versions report both v1-stored gaps and v2-rescored gaps side-by-side, with a prominent caveat when the comparison is partial (only a subset of subscores can be re-computed from the exported CSV).
  • Data quality. Snapshots with low data-integrity scores are excluded from correlation analysis above a minimum quality threshold. The threshold and exclusion count are reported in every Research post.
  • Sample size. Statistical significance can appear between day 29 and day 50 of data accumulation and disappear again — neither direction is a "finding" without out-of-sample confirmation. Published findings document the history.

Reproducibility bounds

Methodology is open; calibration is the product. Concretely:

Reproducible by a third party against their own snapshot CSV:

  • Composite / score vs forward-return correlations (Pearson + Spearman + permutation p-values)
  • Quintile win rates, Sharpe per quintile, quintile edge ratios (MFE / |MAE|)
  • Pre-specified contrasts with circular block bootstrap at multiple block lengths
  • Regime-conditional Spearman matrices per subscore × regime
  • Effective sample size via Bartlett's autocorrelation cutoff
  • Fisher z-transform 95% confidence intervals using N_eff
  • Rank-based tercile contrasts (scale-invariant comparisons)
  • LOWESS smoother with bootstrap 95% band
  • Monotonicity test (Q1 < Q2 < Q3 < Q4 < Q5 ranking)

Not reproducible externally (calibration-dependent):

  • Absolute level of any score (composite, structural, tactical) — depends on unpublished normalisation curves and subscore weights
  • Structural / tactical subfamily decomposition — depends on unpublished subfamily weights
  • policy_hash byte-identity — depends on the serialisation format, which is not published
  • Specific cap values (rules_cap, macro_cap, cycle_cap) — depend on unpublished formula coefficients
  • Live-trading PnL attribution to individual signals

The purpose of this boundary: any quantitative claim the project makes about predictive or economic edge is independently verifiable by a third party from snapshot data alone, without access to the engine's internal coefficients. The coefficients remain the moat.

Editorial standard for findings

Research posts classify every finding into one of four states, using a consistent vocabulary across all publications:

StateMeaning
suggestiveObserved in-sample, passes at least one statistical gate, but either sample size is insufficient or out-of-sample window has not closed. Not a confirmed claim.
in-sample confirmedPasses all declared statistical gates at the current dataset size (permutation + bootstrap CI + block-length sensitivity), out-of-sample window still pending.
out-of-sample confirmedHolds on data not seen during in-sample analysis. This is the only label used without qualification in product-facing copy.
invalidated / driftedPreviously confirmed at some state, subsequently fails on new data. The original finding is retracted in the same page where it was published, with the failure window documented.

The asymmetry matters: a finding that drops from "confirmed" back to "drifted" is a first-class event, surfaced in Research and linked from any product copy that cited it.

Cross-references