Backtest

This page describes the framework RiskState uses to validate whether its composite, structural and tactical scores carry predictive information. It does not publish current results. Dated findings, out-of-sample reports and version comparisons live in Research.

The goal of this page is falsifiability: a reader should come away knowing exactly what question each test answers, what could make a result invalid, and how a third party could reproduce the analysis against their own snapshot dataset.

What we measure

Every API response is captured as a snapshot: the full input state (prices, indicators, caps, composite, structural/tactical scores, regime, volatility, policy level) plus a timestamp and a policy_hash. Snapshots are stored server-side.

A forward return is attached to each snapshot after the fact, computed from the actual market outcome at a fixed horizon:

WindowHorizonPurpose
24h~1 dayTactical signal edge
72h~3 daysShort-term structural tilt
7d~1 weekPrimary validation horizon
30d~1 monthLong-horizon structural validation (slow to accumulate)

Forward returns are filled from public price data (klines), both browser-side and server-side, so the process runs independently of whether the dashboard is open.

Alongside returns, every snapshot gets a max adverse excursion (MAE) and max favorable excursion (MFE) — the worst drawdown and best gain observed intra-window — enabling quintile edge-ratio analysis beyond raw directional returns.

What we are answering

Three distinct questions, tested independently:

  1. Does the composite score carry predictive information at a given horizon? — correlation analysis (Pearson + Spearman + permutation p-values), plus non-linear quintile analysis.
  2. Is the relationship monotonic? — a full Q1 < Q2 < Q3 < Q4 < Q5 ranking check, not just "do extremes beat each other."
  3. Does the signal survive when controlled for asset beta and sample overlap? — Sharpe per quintile (not raw returns), block bootstrap for time-series dependency, and rank-based tercile contrast for scale invariance.

Statistical tooling

Permutation test

For every correlation between composite (or any subscore) and forward returns, a permutation p-value is computed alongside the parametric (Fisher z) p-value. Composite-return pairs are shuffled 2000 times; the permutation p is the fraction of shuffled correlations with |r_shuffled| ≥ |r_real|.

A correlation is only flagged significant when both parametric and permutation tests pass. The reason: overlapping windows (adjacent snapshots share most of their forward-return period) make parametric p-values artificially small. Permutation tests are robust to this under the null.

Pre-specified contrast with circular block bootstrap

The headline validation test is a pre-specified contrast: (Q2∪Q3 win rate) − (Q4∪Q5 win rate) at the 7-day horizon, pooled BTC+ETH. The contrast is specified before looking at the data to prevent fishing across multiple quintile splits.

95% confidence intervals come from a circular block bootstrap (Politis & Romano, 1994). Block length is chosen relative to the data's autocorrelation structure and tested at multiple lengths (14 / 21 / 28 days) for sensitivity. A result is considered confirmed only when the 95% CI excludes zero across all block lengths.

Small-sample guards: the bootstrap is skipped when block length ≥ n, and a warning is raised when block length is more than half of n.

Effective sample size (Bartlett)

Time-series autocorrelation means nominal n over-states the number of independent observations. N_eff is computed via Bartlett's autocorrelation-adjusted formula (positive-only ρ summation, capped at n) and used for df and Fisher z standard errors. Confidence intervals narrow honestly — they do not inherit the false precision of nominal n.

Sharpe per quintile

Raw quintile returns are confounded by per-asset volatility differences (BTC vs ETH do not have the same risk). The primary comparison across assets is Sharpe per quintile, removing the beta confound. Published Research posts default to Sharpe unless raw returns are specifically informative for the point being made.

Fisher z-transform 95% CI

All correlation tables carry 95% CIs via Fisher z-transform, using N_eff rather than nominal n. CIs wider than 0.6 are flagged — they carry little information even when the point estimate looks strong.

Rank-based tercile contrast

A second validation layer, robust to scale changes: independently compute top vs bottom tercile win rates by rank (not by score level). Used specifically when comparing scoring versions (v1 vs v2, etc.), because a monotone transformation of the composite may shift quintile boundaries while preserving the signal. Rank-based tests tell us whether the information content changed, independently of the distribution.

LOWESS smoother

For visual inspection of the score → P(return > 0) relationship, a LOWESS smoother (tricube kernel) with bootstrap 95% band renders the true shape without imposing an arbitrary bin count. This is how we discover non-monotonic patterns (inverted-U, regime-dependent) that Pearson r would miss.

Reproducibility

The same statistical suite that runs inside the dashboard also runs offline as a standalone script, deterministic under a fixed seed, against an exported CSV of snapshots.

  • Deterministic seed — every randomised step (permutation, bootstrap resample, subsample) is seeded. Re-running the validator against the same CSV produces bit-identical output.
  • Open framework, closed calibration — the CSV export includes the raw indicators needed to re-run correlation and quintile analysis. The scoring coefficients themselves (weights, normalisation ranges, thresholds) are not published.
  • Block length sensitivity — every bootstrap result is reported at multiple block lengths; a finding that only works at one block length is flagged as non-robust.
  • Replications welcome — reviewers are invited to replicate the analysis against their own snapshot datasets. The methodology is open; calibration is not.

Validation maturity

Validation is an ongoing process, not a one-shot claim. Our current posture:

  • In-sample window: the accumulated snapshot dataset at the date of each Research post.
  • Out-of-sample gate: a pre-committed follow-up at a longer horizon (typically +60 days) before a finding graduates from "suggestive" to "confirmed out-of-sample".
  • Scoring version gates: whenever scoring_version changes (quarterly maximum — see Methodology → Versioning), the prior version's validation does not automatically transfer. A new in-sample check and out-of-sample gate apply.
  • Regime coverage warning: a finding from a dataset covering a single macro regime is labelled as such and not extrapolated.

Known sources of bias we actively manage

  • Overlapping windows. Adjacent snapshots at a few hours apart share most of a 7d forward return. Treating them as independent inflates significance roughly by sqrt(window_length / sampling_interval). Permutation tests, block bootstrap and N_eff are the three mechanisms used to prevent this.
  • Linearity trap. The composite-return relationship is often non-monotonic (inverted-U, regime-dependent). Pearson r on the full range can be near zero even when a clear signal exists at specific quintiles. Quintile and LOWESS analyses are the correct tools; the correlation table is context, not the headline.
  • Scoring-version drift. Rescoring old snapshots with current scoring_version is available but must decouple snapshot asset / macro state from whatever the live dashboard currently shows — the rescoring pipeline explicitly passes the snapshot's own context rather than reading current globals. A/B comparisons between scoring versions report both v1-stored gaps and v2-rescored gaps side-by-side, with a prominent caveat when the comparison is partial (only a subset of subscores can be re-computed from the exported CSV).
  • Data quality. Snapshots with low data-integrity scores are excluded from correlation analysis above a minimum quality threshold. The threshold and exclusion count are reported in every Research post.
  • Sample size. Statistical significance can appear between day 29 and day 50 of data accumulation and disappear again — neither direction is a "finding" without out-of-sample confirmation. Published findings document the history.

What we do not publish

  • Specific subscore weights, normalisation coefficients or threshold values.
  • The exact content of the CSV export beyond the documented column list.
  • The seed used for any given Research post (only that a seed is used and results are reproducible with it).
  • Any live-trading PnL attribution to individual signals.

The purpose of this discipline is simple: the methodology is auditable, the calibration is the product.

Cross-references