One of our most essential skills at Quantlane is differentiating between good and bad trading strategies . Running a lossy trading strategy would be a very costly mistake, so we spend a lot of effort on assessing the expected performance of our strategies. This task gets harder when we have limited data for this evaluation or when we experiment with the strategy for a longer time and risk manually overfitting the strategy on the same out-of-sample data.
But before we get into details, let's define the terms we are using here.
Introduction to modelling
A trading strategy is a set of rules how to buy and sell market instruments, eg. equities, based on market data. The set of rules can be handcrafted or implemented by a machine learning model trained on historical market data. An example of a handcrafted strategy could be 'Buy 10 shares when the price has dropped more than 5% in the last 24h, hold them for 48h, and then sell them.' Note that even though it is handcrafted, it has parameters that we can optimize automatically in a similar fashion to a machine learning model.
Optimization is the process of finding model parameters that maximize some measure of model performance. In our case, this measure is usually the profitability of the model in simulated trading on historical data. We call this simulation a backtest. Optimization can be done for any parametric model, be it a handcrafted ruleset of three parameters, like in the above example, or a neural network with a hundred thousand parameters.
Overfitting occurs when a model performs much better on training data than on other data. We also say the training data are in-sample (IS) and the 'other' data are out-of-sample (OOS). An extreme example of overfit is a model that just remembers the training data, can have even 100% accuracy on them, but essentially makes random guesses on any other data. Note that it is normal for a model to have slightly better performance on training data.
Overfitting issues are twofold. First, we need to evaluate whether a model overfits trading data. And second, we want to reduce the overfitting, ideally by reducing out-of-sample error (improving generalization) even at cost of increasing in-sample error. In this article, we will focus on the first part: evaluation.
Overfitting evaluation metrics
The basic way to evaluate overfitting is comparing in- and out-of-sample performance in whatever metric you use. If we use an error metric such as MSE, we can introduce an overfitting ratio OR = MSE_OOS / MSE_IS. If the ratio is significantly greater than 1, the model seems to overfit. This is the most common metric to watch in your machine learning workflows.
This method relies on one of the common machine learning assumptions: that the distribution of training data and the data with which the model will be used in practice are the same. In other words, the system producing the data didn't change. However, in real-world scenarios, the system always changes. If we predict markets, the markets change; and if you predict customers and their behaviours, those change as well. So, it's easy to confuse model quality with systematic changes in the observed system. This is often solved in data preprocessing and normalization to make the data as stationary as possible. Common preprocessing can involve, for example, removing long-term trends or seasonality, which are then handled outside the model. As we usually work with time series data in Quantlane, we can also use time-series-specific overfitting measures. For example, if the model performance is stable for a long time, the performance will most likely remain similar in near-term 'out-of-sample' data. Plotting performance in time is also very useful to examine changes in the observed system behavior. In the case of trading strategies, they often 'deteriorate' and eventually stop working as more market participants discover the strategies and by trading them they eliminate the market inefficiencies that are causing the strategies to work in the first place.
Out-of-sample testing caveats
The concept of out-of-sample testing is pretty hard to follow correctly. Strictly speaking, data stops being out-of-sample at the time when you alter the model parameters or training parameters based on the out-of-sample evaluation results, even if you do so manually! Because by doing so, you can overfit the model on the out-of-sample data, which defeats their purpose for evaluation. So, in an ideal world, we would want to use new (and long enough) out-of-sample data for each serious evaluation round, which is usually impossible. Also, just running the training and evaluation several times can result in the last run having a significantly improved performance compared to an average run just by luck.
So far we have discussed how to detect overfitting so we can avoid it. But are overfit models always wrong? In practice, we are mostly interested in out-of-sample performance regardless of if the model is slightly better on in-sample data—or completely flawless, and totally overfit by definition. That means we should aim to maximize out-of-sample performance and not minimize a metric similar to the overfitting ratio we introduced earlier. For example, in time series prediction, an overfit model might include rules that worked in January, didn't work in February, and might work again in March.
In the end, evaluating models precisely is a fairly complex task that requires adequate understanding of the model, optimization, data, and the modelled system. In practice, we often have to get by with an imprecise evaluation done on a limited amount of data. That is usually a sufficient method for assessing a model, but less reliable when comparing different versions of a model with similar performance. For more precise evaluation, we must employ advanced techniques such as k-fold cross-validation and also use more out-of-sample data.