Edge Matrix Validator is live! · Start your 7-day free trial — use code EDGEMATRIX25 for 25% off Try it now →

Most backtesting validators operate as black boxes. You upload a report, you get a score, and somewhere between those two events a calculation happens that you are not allowed to see. This is a problem. A score you cannot interpret is not a validation — it is a label.

Edge Matrix does not work this way. Every test has published methodology, defined thresholds, explicit weights, and documented scoring logic. This article opens the engine on two of the most diagnostic tests in the suite: the Temporal Stability test and the Edge Decay test. These two tests measure different things — one asks whether your EA performed consistently across calendar time, the other asks whether its statistical edge is deteriorating as the backtest progresses — and together they catch failure modes that every other metric in a standard backtest report completely misses.

The code below is from the actual Edge Matrix calculation engine. The math is the math. If you want to understand what your score means, this is where it comes from.

Two Tests, One Question

Before getting into the mechanics, the conceptual distinction matters. Temporal stability and edge decay sound similar but they are measuring fundamentally different things.

Temporal stability asks: did the strategy produce consistent results across calendar-time sub-periods? It divides the backtest into equal time buckets and evaluates profitability and return consistency within each bucket. A strategy that made money in four of six sub-periods and lost in two is temporally unstable — it performs differently in different market conditions. A strategy that was consistently profitable with similar return magnitudes across all periods scores well, regardless of whether that performance is improving or declining over time.

Edge decay asks: is the strategy’s statistical edge trending in a specific direction through the backtest? It measures whether expectancy, profit factor, and win rate are improving, stable, or deteriorating as you move from the first trade to the last. A strategy can be temporally stable — profitable in every sub-period — while simultaneously showing clear edge decay: the profit factor was 2.1 in the first half and 1.3 in the second. The opposite is also possible: a strategy with high temporal instability but no systematic trend, just random variation across sub-periods.

Both matter. Both are tested. Their scores are weighted separately in the Edge Matrix composite score.

Temporal Stability: The Exact Algorithm

The temporal stability test (version 6) begins by dividing the backtest into equal calendar-time buckets. The number of buckets adapts to the trade count: strategies with 500 or more trades get 10 sub-periods, 200 to 499 trades get 8, 100 to 199 trades get 6, and below 100 trades get 5. This adaptive sizing ensures each sub-period contains enough trades to produce meaningful statistics.

The most important fix in version 6 is sparse period handling. Earlier versions dropped sub-periods with too few trades — which sounds reasonable but creates a significant bias. If a strategy trades infrequently during low-volatility regimes and heavily during trending regimes, the version that drops sparse periods is effectively only evaluating the high-activity windows. The score looks better than it should. Version 6 instead merges sparse buckets forward into adjacent periods, preserving all calendar time in the analysis. When periods are merged, the verdict notes it explicitly so the result is interpretable.

For each sub-period, the test computes the return percentage and maximum drawdown. The return is expressed as profit divided by the starting equity of that period — not a simple return on initial capital, which would distort results for periods that begin after significant drawdown or run-up.

The final score blends two components at 60% and 40% weights.

The first component, weighted at 60%, is the profitability base score. This scores the number of losing periods on a sliding scale. Zero losing periods with 90% or more strong positive sub-periods (defined as greater than 5% return) scores 100. One losing period scores between 48 and 70 depending on what fraction of the other periods are profitable. Two losing periods scores between 42 and 58. The degradation accelerates: five losing periods scores 28, and above five the score falls at 3 points per additional losing period with a floor of 15.

The second component, weighted at 40%, is the return consistency score. This is version 6’s most significant improvement over its predecessors. It penalizes two patterns that earlier versions ignored: high variance in sub-period returns and profit concentration in a single period.

Variance is measured by the coefficient of variation of sub-period returns — the standard deviation divided by the mean. A coefficient of variation below 0.5 scores 100: every sub-period earned roughly the same amount. A CV between 0.5 and 0.8 scores 88 to 100. Above 2.0 the score falls below 45, and above 3.0 the score floors at around 10. The intuition is correct: a strategy where one period earns 40% and all others earn 2% has high CV and deserves a low consistency score even if it was never technically losing.

Profit concentration is computed separately as the fraction of total positive returns produced by the single best sub-period. If one period accounts for 30% or less of all positive returns, no penalty applies. Between 30% and 50% concentration, a penalty of up to 10 points reduces the consistency score. Between 50% and 70%, the penalty grows to 25 points. Above 70% — one period carrying more than 70% of all gains — the penalty reaches 45 points. This catches the pattern where a strategy’s backtest result is essentially a single lucky period surrounded by mediocre ones.

On top of the blended score, a drawdown concentration penalty applies. If the worst sub-period’s drawdown accounts for more than 40% of all drawdown across the backtest, points are deducted. A single sub-period responsible for more than 80% of total drawdown produces a maximum penalty of 18 points — because it means the strategy’s risk was essentially realized in a single episode, which is a concentration signal regardless of what the aggregate maximum drawdown figure reports.

Edge Decay: Four Components, One Verdict

The edge decay test measures whether the statistical edge of the strategy is trending in a specific direction across the full trade sequence. It uses four components with explicit weights: rolling expectancy trend at 40%, first-half versus second-half expectancy ratio at 30%, profit factor trend across four quartiles at 20%, and win rate trend across four quartiles at 10%.

The rolling expectancy trend is the primary component. A window of 10% of total trades (minimum 30) rolls across the trade sequence with a step of one-third the window size for overlap. At each position, the expectancy within that window is computed as the weighted average of win and loss outcomes. The resulting series of expectancy values is then fitted with a linear regression. The slope of that regression is normalized by the mean expectancy across the series — this normalization is critical because a slope of $5/trade on a strategy with mean expectancy of $100/trade is very different from the same slope on a strategy with mean expectancy of $8/trade.

A normalized slope of 0.5 or above — expectancy growing at half its own mean per unit of the sequence — scores 100. A slope near zero scores 88. A slope of -0.2 scores 78, meaning a mild downward trend is not immediately disqualifying. The penalty accelerates as decay strengthens: a slope of -1.0 scores 48, -1.5 scores 32, -2.0 scores 20, and below -2.0 the score is 10. A strategy where the rolling expectancy window shows consistent deterioration of twice its own mean from start to finish is telling you something important that the aggregate profit factor never would.

The first-half versus second-half comparison is straightforward: it computes expectancy independently in the first 50% of trades and the second 50%, then takes the ratio. A ratio of 1.2 or above — the second half performing 20% better than the first — scores 100. A ratio of 1.0 scores 90. A ratio of 0.85 scores 78: the second half is only 85% as effective as the first, a mild but real decay signal. Below 0.5 the score drops to 38 and below, and below 0.3 the score floors near 10. One specific case is handled explicitly: if the first half has negative expectancy and the second half is positive, the ratio is set to 2.0 and rewarded — this is the pattern of a strategy that was being tuned or adapted and actually improved, which is worth distinguishing from ordinary decline.

The profit factor trend measures whether profit factor across the four quartiles of the trade sequence is improving or declining. The quartiles are equal-sized trade windows — the first 25% of trades, the next 25%, and so on. A profit factor is computed for each quartile, then a linear regression is fitted across the four values. The normalized slope uses the same approach as the expectancy component: raw slope divided by mean profit factor. A positive or zero slope scores 88 to 100. A slope of -0.5 scores 58. A slope of -0.8 scores 42. Below -1.2 the score is 10. This component catches the strategy whose profit factor was 2.3 in Q1, 1.9 in Q2, 1.4 in Q3, and 1.0 in Q4 — a clear deterioration that the aggregate 1.65 profit factor perfectly conceals.

The win rate trend uses the same quartile approach on win rate. The weight is only 10% because win rate is a less stable metric than expectancy — it is noisier on small samples and does not capture the size dimension of wins and losses. A flat or rising win rate across quartiles scores 88 to 100. A declining win rate of 5% per normalized unit scores 78. Below 25% normalized slope decline, the score reaches 38.

The four component scores combine at their respective weights to produce the final edge decay score. The verdict scale is: 85 and above is STABLE — edge is consistent with no meaningful decay. 70 to 84 is MODERATE — minor decay detected, monitoring advised. 55 to 69 is DECLINING — a clear downward trend is present. 40 to 54 is WEAK — significant deterioration across the backtest. Below 40 is CRITICAL — severe edge decay detected, and the aggregate metrics of this backtest are unreliable guides to future performance.

Why Two Tests Instead of One

The reason Edge Matrix runs both temporal stability and edge decay as separate tests with separate weights is that they catch different populations of failing strategies.

A strategy optimized on a trending year and a ranging year will often look temporally unstable — its sub-period results vary significantly because different market conditions suit it differently. But it may show no systematic edge decay: the variation is random rather than directional. The temporal stability test flags this. The edge decay test does not.

A strategy that was curve-fit to the first three years of a five-year backtest often shows temporal stability — it was profitable in most sub-periods — while showing clear edge decay: it was genuinely robust in the fitted period and is quietly deteriorating in the unfitted period that comes after it. The edge decay test catches this through the first-half/second-half ratio and the rolling expectancy trend. The temporal stability test may not catch it at all, depending on how the sub-period boundaries fall.

The combination is what matters. A strategy that scores well on both has consistent profitability across calendar time and no evidence of directional performance deterioration. These two properties together represent the kind of temporal robustness that a genuine statistical edge should display — not because every period needs to be good, but because the pattern of good and bad periods should not be systematically biased toward the past.

Other Tests in the Suite

Temporal stability and edge decay are two of 18 tests in Edge Matrix. The others include profit concentration (whether the top 10% of winners account for an excessive fraction of total profit, and whether a single trade dominates the P&L), drawdown analysis (which separately scores maximum drawdown depth, average episode severity, and recovery speed as a days-per-percent-of-drawdown ratio), and streak resilience — which compares the observed maximum losing streak against the statistically expected streak given the strategy’s win rate, using the formula log(N) / log(1/loss_rate), and flags when the observed streak exceeds the statistical expectation by more than 50%.

The streak test also measures loss clustering specifically — whether losses occur in consecutive runs more often than the strategy’s win rate would predict through random distribution. This is a regime-dependency signal: if losses cluster together beyond what random chance explains, the strategy is not generating losses uniformly across market conditions. It is generating them in clusters, which is the behavioral signature of a strategy that works in some regimes and fails in others.

Recovery speed in the streak test contains one counterintuitive scoring rule worth noting. Recovery that is too fast — specifically, recovery that occurs in far fewer trades than the mathematical expectation based on expectancy — is scored lower than recovery at the expected pace. This is a martingale detection signal: strategies that recover from their worst streak in suspiciously few trades are often doing so because position size increased during the streak. The code explicitly checks whether recovery efficiency is below 0.3 of the expected value and assigns a score of 50 rather than the perfect 100 it would otherwise receive for fast recovery. Fast is not always good.

What Transparency Actually Means

Publishing the methodology of a validation tool is not a common practice. It creates obvious risks — someone could try to optimize their strategy specifically against the scoring function rather than for genuine robustness. But this concern is less serious in practice than it appears. Gaming a score that measures temporal consistency, edge stability, and profit distribution simultaneously requires a strategy that actually has those properties. The attempt to game the score produces the property being measured.

The more important reason for transparency is trust calibration. When Edge Matrix tells you a strategy scores 43 on temporal stability with a verdict of WEAK, you should be able to understand exactly what that means: the strategy had losing sub-periods beyond the threshold where significant penalty applies, its return CV indicates high variance across sub-periods, and possibly one period is carrying a disproportionate share of the positive returns. That is specific, actionable information. It tells you what to look at and what to worry about.

A score from a black box tells you a number. A score with published methodology tells you a diagnosis.

Edge Matrix launches in late April 2026 at ergodiclabs.co. The free Monte Carlo analyzer at ergodiclabs.co/monte-carlo runs the first of the 18 tests now, with no account required.

Tags: , , ,

Risk Disclosure

Edge Matrix is a statistical analysis tool. It evaluates historical backtest data using quantitative methods but does not predict future performance or provide investment advice. Edge Matrix does not recommend whether to deploy, modify, or discontinue any trading strategy. All trading involves substantial risk, including the risk of loss. Past performance, whether analyzed or validated, is not indicative of future results. Users are solely responsible for their trading and investment decisions.

Trading foreign exchange carries a high level of risk that may not be suitable for all investors. Past performance is not indicative of future results. The high degree of leverage can work against you as well as for you.