Why Your Maximum Drawdown Is Almost Always a Lie, And the One Number You Should Use Instead

The maximum drawdown reported in a backtest is the largest peak-to-trough decline that occurred during the specific historical period tested. It is a fact about the past. It is not a risk metric for the future. The distinction matters more than most traders realize, and the academic literature on drawdown risk has been making this argument with increasing precision since at least 2003.

This article explains exactly why historical maximum drawdown systematically underestimates the drawdown a strategy will experience in live trading, what the correct framework for measuring drawdown risk actually is, and how to use Monte Carlo simulation to derive the one number that should replace maximum drawdown in every position sizing and risk management decision you make.

Maximum Drawdown Is a Single Path Through a Stochastic Process

A trading strategy generates a sequence of trade outcomes. Those outcomes have some underlying distribution — a mean, a standard deviation, some degree of skewness and kurtosis — but the specific sequence in which they arrive is random. The maximum drawdown you observe in a backtest is the worst peak-to-trough decline along the one specific path those trades happened to take through time.

If the same 300 trades had arrived in a different order — which they could have, because the market could have presented the same opportunities in any sequence — the maximum drawdown would have been different. Not slightly different. Potentially dramatically different. Burghardt et al. (2003), in one of the foundational papers on drawdown distributions, demonstrated through Monte Carlo simulation that maximum drawdown distributions are highly sensitive to track record length: increases in the length of the track record shift the entire distribution to the right. More trades, more opportunities for a worst-case sequence to emerge. The maximum drawdown from a 100-trade backtest is not the same kind of number as the maximum drawdown from a 1,000-trade backtest, even for strategies with identical statistical properties.

This is the core problem. When a trader looks at a backtest showing 8% maximum drawdown and sizes their position to risk 2% of capital per trade, they are implicitly assuming that 8% is a reliable estimate of the worst case they will face. It is not. It is the worst case that happened to occur in the specific sequence of trades in the specific historical window tested. The actual worst case — the one that could emerge from an unlucky but entirely plausible ordering of those same trades — is almost always larger.

What the Academic Literature Actually Shows

The research on drawdown as a risk measure is more developed than most retail and semi-professional traders know. Three bodies of work are particularly important.

Chekhlov, Uryasev, and Zabarankin (2003, 2005) formalized what they called Conditional Drawdown-at-Risk (CDaR) — a family of risk measures defined on the portfolio drawdown curve rather than on individual return observations. Their key insight was that maximum drawdown, as a risk measure, contains two fundamental problems. First, it is not a stable statistical estimator — it depends on a single extreme observation from a single path, making it highly sensitive to the specific data used. Second, it ignores the entire shape of the drawdown distribution, treating a strategy that occasionally hits 10% drawdown the same as a strategy that hits 10% in nearly every simulation. CDaR addresses both problems by computing the average of the worst drawdowns above a given threshold across all observed periods, making it a coherent risk measure in the technical sense: it is convex and positive homogenous, meaning it supports meaningful portfolio optimization.

Goldberg and Mahmoud (2016, 2017), building on the CDaR framework at UC Berkeley’s Center for Risk and Asset Management, developed Conditional Expected Drawdown (CED) — defined as the tail mean of the maximum drawdown distribution. CED at confidence level α is the expected maximum drawdown given that the maximum drawdown exceeds its α-quantile across all simulated paths. Their work demonstrated something particularly important for practitioners: CED is sensitive to serial correlation in returns in a way that Sharpe ratio, Value at Risk, and Expected Shortfall are not. A strategy with positively autocorrelated returns — where winners tend to cluster and losers tend to cluster — will have a substantially higher CED than a strategy with identical mean and variance but independent returns. The Burghardt/Harding research (2012) later confirmed this empirically: the expected maximum drawdown with positive autocorrelation is dramatically higher than with independent returns, even when all other statistical properties are held constant.

Casati and Tabachnik (2012/2013) studied the statistical properties of maximum drawdown directly using Monte Carlo simulation across stochastic processes characterized by real financial return distributions — non-independent increments, excess kurtosis, and skewness. Their key finding: the maximum drawdown distribution for realistic financial processes differs substantially from what the Brownian motion (normal distribution) framework predicts. Heavy tails in the return distribution produce heavy tails in the drawdown distribution. The maximum drawdown you observe in a finite sample is a draw from this distribution. In most cases, it is a draw from the lower half of the distribution — meaning the observed maximum drawdown understates what the strategy’s true drawdown risk looks like.

The ScienceDirect paper on Maximum Drawdown at Risk (2017), validated across eight world stock indices using Monte Carlo methodology, concluded that MDaR-based estimates provide more accurate market risk control than historical maximum drawdown. The methodology proposed — fitting a model to return dynamics and then simulating the drawdown distribution — is exactly what Monte Carlo permutation of backtest trades approximates for algorithmic trading strategies.

The Sequence Problem: A Concrete Demonstration

Consider a strategy with 200 trades, a 55% win rate, average win of $80, and average loss of $60 — a legitimately positive expected value system with a profit factor of approximately 1.63. In the specific historical sequence those trades occurred, the maximum drawdown was 12%.

Now reshuffle those 200 trades randomly 5,000 times and compute the maximum drawdown of each reshuffled sequence. What does the distribution look like?

The distribution will not be centered on 12%. It will be centered somewhere higher, and it will have a long right tail. Depending on the strategy’s specific return distribution, the 50th percentile (median) drawdown across the 5,000 simulations might be 14–18%. The 95th percentile — meaning 95% of all simulated sequences produce a drawdown below this level — might be 22–30%. The 99th percentile might be 35–45%.

The 12% observed in the backtest is not wrong. It happened. But it happened along one path. The question is not what happened — the question is what the strategy’s drawdown risk actually looks like across the full distribution of possible paths. That distribution, not the single observed point, is what determines whether your position sizing is appropriate for the risk you are actually taking.

Burghardt’s research established that this gap between observed maximum drawdown and the true drawdown distribution is not random noise — it follows predictable patterns. Longer track records produce larger expected maximum drawdowns even for identical strategies, because longer records give more opportunities for adverse sequences to materialize. Higher volatility of returns increases the entire distribution, not just the mean. And positive autocorrelation — where the strategy tends to win in clusters and lose in clusters — dramatically inflates the drawdown distribution relative to what the mean and variance alone would predict.

Why Backtest Maximum Drawdown Is Biased Downward

There is an additional source of downward bias beyond sequence dependency. Backtest maximum drawdown is typically measured on a trade-by-trade basis using closing prices, not on a tick-by-tick or bar-by-bar basis using intrabar extremes. A trade that opened at 1.1000, moved against the position to 1.0850 intrabar, then closed at 1.0950 for a small loss records a small loss in the backtest. The intrabar excursion — the additional 100 pips of drawdown that the actual account experienced before the close — is invisible.

For strategies tested on H1 or D1 data, this intrabar drawdown gap can be substantial. A daily bar with a 1% range covers six to eight hours of price movement. The maximum adverse excursion during those hours might be double the closing loss. This means the maximum drawdown in the backtest report understates the drawdown the account actually experienced during similar historical conditions, independent of any sequence effects.

The combination of sequence dependency bias and intrabar excursion bias means that historical maximum drawdown has two systematic reasons to understate true risk. This is not a flaw in the backtest methodology that can be corrected by using better data or more sophisticated software. It is a fundamental property of measuring drawdown on a finite historical sample from a stochastic process.

The Correct Framework: P95 Drawdown from Monte Carlo

The practical solution that emerges from the academic literature is to replace the single historical maximum drawdown with a quantile of the Monte Carlo drawdown distribution. The 95th percentile drawdown — P95 — means that 95% of simulated trade sequences produce a maximum drawdown below this level. It is the drawdown you should survive if things go reasonably badly, covering all but the worst 5% of plausible outcomes given your strategy’s return distribution.

P95 is not a worst-case scenario. It is a statistically grounded planning figure. It answers the question: given that my strategy has the win rate, average win, average loss, and trade distribution I have observed historically, what maximum drawdown should I expect to survive with 95% confidence across all possible orderings of those trades?

The ratio of P95 to historical maximum drawdown is one of the most informative single numbers in quantitative strategy evaluation. For a strategy with genuinely sequence-independent returns and a long historical sample, this ratio might be 1.5 to 2.0 — the P95 drawdown is 50% to 100% larger than what the backtest reported. For strategies with short track records, high volatility, or positive autocorrelation in returns, the ratio can exceed 3.0. When the P95 drawdown is three times the historical maximum, a trader who sized positions based on the historical figure is running three times the risk they believed they were running.

The Goldberg and Mahmoud CED framework formalizes this intuition rigorously. CED at the 95% level is conceptually equivalent to P95 in the Monte Carlo framework: it is the expected drawdown conditional on being in the worst 5% of outcomes. Both measures ask the same fundamental question and arrive at the same practical answer: the tail of the drawdown distribution, not its historical minimum realization, is the relevant risk parameter.

Position Sizing Based on P95

The practical implication is direct. If you currently size positions based on historical maximum drawdown — either directly, by setting a fixed fractional risk relative to the reported drawdown, or indirectly, by scaling lot sizes to keep drawdown below some capital percentage threshold — you should replace the historical figure with P95.

The mechanics are straightforward. If your backtest reports a 10% maximum drawdown and P95 from Monte Carlo is 22%, and your maximum tolerable drawdown before stopping the strategy is 15% of capital, then:

Sizing based on historical max DD: 15 / 10 = 1.5x the backtest lot size seems acceptable. Sizing based on P95: 15 / 22 = 0.68x the backtest lot size is the correct figure. The historical-DD-based sizing is running more than double the intended risk.

This calculation is not conservative — it is accurate. The 22% P95 figure does not predict that you will experience a 22% drawdown. It predicts that, given your strategy’s return distribution, there is a 95% probability that your drawdown will stay below 22% across all plausible trade orderings. Sizing to survive a P95 drawdown is not pessimism; it is correctly calibrated risk management.

The CDaR framework proposed by Chekhlov et al. takes this further by incorporating drawdown constraints directly into the position sizing optimization. Rather than solving for position size as a function of some assumed worst-case drawdown, CDaR-optimized sizing explicitly penalizes strategies that spend significant time in drawdown, even if no single drawdown exceeds a hard limit. This captures something important: a strategy that is constantly in shallow drawdown — never catastrophically bad but always recovering — is more psychologically damaging and practically problematic than a strategy with occasional deep drawdowns and fast recoveries. The drawdown curve, not just its maximum, contains information about risk.

What the Ratio of P95 to Historical Max Drawdown Tells You

Beyond its use in position sizing, the P95-to-historical-MDD ratio is a diagnostic for the quality of the backtest itself. A ratio close to 1.0 would theoretically indicate perfect sequence independence — every ordering of trades produces approximately the same maximum drawdown. In practice, this never occurs. A very low ratio (below 1.3) is suspicious: it may indicate that the backtest is using a very long trade history that happens to include the true worst-case sequence, or that the strategy’s trades are unusually independent of one another.

A ratio above 3.0 raises different concerns. It typically indicates one or more of the following: the historical sample is short, giving few opportunities for adverse sequences to appear; the strategy’s returns are highly autocorrelated, meaning losses cluster; or the return distribution has heavy tails that the historical sample has not yet fully explored. Any of these conditions means the historical maximum drawdown is a particularly unreliable estimate of true risk.

The ratio is also sensitive to the shape of the loss distribution. Strategies that take small, frequent losses generate a different drawdown distribution than strategies that take rare, catastrophic losses. The P95 drawdown integrates across the full loss distribution. The historical maximum drawdown may be dominated by a single unusual trade that will not recur, or it may have avoided the conditions that would have produced the strategy’s genuinely worst-case sequence. Monte Carlo permutation exposes both possibilities.

The Specific Failure Mode: Undersized Stops for Overestimated Capital Efficiency

There is one practical failure mode that the P95 framework catches and historical maximum drawdown consistently misses. When a trader observes a low historical maximum drawdown, they may conclude that the strategy is highly capital efficient — that it can run at a large position size relative to account equity without risking meaningful drawdown. This conclusion is frequently wrong in exactly the worst situations.

Strategies with very low historical maximum drawdowns tend to have one of two characteristics. Either they genuinely have tight, sequence-independent return distributions — which a low ratio of P95 to historical MDD confirms — or they were backtested on a short or favorable historical period that happened to avoid the conditions that trigger the strategy’s worst sequences. The second case produces a strategy that looks excellent in backtest, gets allocated substantial capital based on the low drawdown figure, and then hits a P95-level drawdown in live trading that the trader had no framework to anticipate.

This is not a hypothetical. It is the most common mechanism by which algorithmically backtested strategies fail in live deployment. The strategy was real — the trades, the profit factor, the win rate were all genuine. What was not real was the maximum drawdown figure used to calibrate position sizing. The historical sequence happened to be favorable. The live sequence was not.

A Note on Methodology: Permutation vs Bootstrap

Monte Carlo drawdown analysis for trading strategies uses one of two resampling methods: permutation (shuffling the historical trades without replacement) or bootstrap (resampling with replacement, allowing individual trades to appear multiple times). The two methods answer subtly different questions.

Permutation tests the sensitivity of drawdown to trade sequence, holding the trade distribution fixed. Every simulated path contains exactly the same set of trades as the historical backtest, just in a different order. This directly addresses the sequence dependency question and is the appropriate method for asking: given that I would have taken these trades, what is my drawdown distribution?

Bootstrap allows trades to be repeated and others to be omitted. This tests robustness to sampling variation in the trade distribution itself — asking: if the future slightly resembles but does not exactly replicate the past, what does my drawdown distribution look like? Bootstrap methods tend to produce slightly wider drawdown distributions than permutation for the same underlying trade data, which is appropriate given that the future trade distribution is unknown.

For the specific purpose of correcting the position sizing bias in historical maximum drawdown, permutation is the more conservative and direct method. For the purpose of stress-testing strategy robustness more broadly, bootstrap provides additional information. Both are more informative than the historical maximum drawdown figure alone.

The Number to Use

Replace maximum drawdown with P95 in every position sizing and risk tolerance calculation you make. The historical maximum drawdown remains useful as a historical fact — it tells you what actually happened along the one path you tested. But it should never be used as a risk estimate for an unknown future path.

P95 from a Monte Carlo simulation of 1,000 or more permutations of your strategy’s historical trades is not a prediction. It is a distributional estimate: given your strategy’s return characteristics, this is the drawdown level that 95 out of 100 plausible future orderings of your trades will stay below. Setting your position size so that a P95-level drawdown is survivable within your risk tolerance is the minimum standard of quantitatively rigorous risk management.

The academic literature from Burghardt through Chekhlov through Goldberg and Mahmoud all arrives at the same conclusion through different mathematical routes: maximum drawdown as a single number from a single historical path is not a reliable risk measure. The tail of the drawdown distribution is. Monte Carlo simulation makes that tail visible. Using it is not optional for anyone serious about the relationship between their backtested strategy and the risk they are actually running.

The free Monte Carlo analyzer at ergodiclabs.co computes P95 drawdown from your MT4, MT5, or cTrader backtest report and shows it alongside the historical maximum drawdown. The ratio between the two is one of the first things worth examining in any backtest validation.