Skip to content

Random Subspace Optimization (RSO)

October 6, 2013

subspace

In the last post, we discussed mean-variance optimization and its relationship to statistical theory. One of the challenges of optimization and creating trading models is trying to extract generalizable results when the search space is large. This is especially problematic when the quantity of data is small in relation to the number of features. This is known as the curse of dimensionality. A good slide (Gutierrez-Osuna, Intro to Pattern Recognition) depicts how adding features increases data sparsity in relation to sample size:

curse of dimensionality

 

The curse of dimensionality implies that there is an optimal number of features that can be selected in relation to sample size to maximize classifier performance. The challenge of reducing dimensionality is therefore critical:

curse of dimensionality 2

The same concepts apply to mean-variance or any other type of portfolio optimization method where there are a lot of inputs to estimate, a certain number of assets and also a calibration or lookback window to estimate the inputs. In a typical portfolio optimization with “K” assets and a lookback window of “N” dimensionality can be expressed  approximately as:

D=K/N

*note this excludes the number of different types of inputs required for the optimization (ie returns, covariance, variance)

A large K and a small N or any other combinations that produce a high ratio will reduce the likelihood that an optimization is meaningful.  If for example, I perform mean-variance optimization on 500 stocks in the S&P 500 with 20 days of data, the resulting portfolio weights should be unlikely to outperform a more naaive equal weight benchmark. Ideally K would be a fraction that is as small as possible, while preserving the latency of the data with shorter lookback windows. Methods such as re-sampling/bagging and shrinkage are not very effective at reducing the dimensionality of the problem. I would like to introduce a simple method to address this problem directly. Enter the “random subspace method” (RSM) which was introduced to improve classification on highly dimensional problems such as gene expression. The basic idea is that RSM randomly selects a smaller number of features from a much larger feature set and calibrates classifiers on the lower dimensional “subspaces.” These classifiers and then aggregated in an ensemble where their forecasts are averaged to produce a final forecast or classification.  In RSO, this can be generalized as using a particular optimization method, and selecting “k” assets randomly from the original pool of “K” assets (where k<K) and performing the optimization on each subspace. These weights are then aggregated in some manner (either weighted or equally-weighted) to find the final portfolio weights. This should produce a final portfolio that is more robust in most cases than running optimization on all assets simultaneously. To extend the example to the S&P500, one might select 100 samples of 100 assets from the S&P500 and running optimization on each sample and then aggregate the weights to determine the final portfolio. Here is the pseudo-code:

1) Select a value for “k”, and a number of random subspaces “s”

2) Choose k assets randomly from K assets in the universe and perform optimization

3) Repeat this procedure “s” times

4) Average the weights across the subspaces using some heuristic- this is the final set of portfolio weights

There are a lot of logical extensions to this idea- including varying the number of input factors used in the optimization- for example, one could assume that either the returns, covariances or variances are all equivalent as a method of projecting subspaces. One could use both assets and input factors at the same time to project subspaces. Furthermore, RSO is completely compatible with re-sampling/bagging and shrinkage, and these can also be performed simultaneously in the same process. RSO can be employed with virtually any type of optimization procedure- with the caveat that it is a slower procedure and is less compatible with stochastic methods.

Mean-Variance Optimization and Statistical Theory

October 3, 2013

dimensions and subspaceMean-variance optimization (MVO) was introduced by Markowitz as a means of compressing forecasts into an expression of portfolio weights for asset allocation. The theory and mathematical concepts have become central to modern finance. Views on MVO are highly polarized- some feel that it is worthless while others think it is the holy grail. Somewhere in the middle lies the truth. In reality the strengths and weaknesses of MVO lie rooted in statistical theory. MVO suffers from a lot of different weaknesses- many of which will not be addressed in this post  – for example MVO assumes that the data is normally distributed (model bias). In this post, the assumption is that MVO will use historical data rather than expected return/factor models. In this case, MVO requires the estimation of a large number of inputs for the return vector and the variance-covariance matrix. The number of different variables to estimate is:

2N+1/2(N^2-N)

where N= the number of assets in the optimization. The first term 2N= the return and risk estimates, and the second term 1/2(N^2-N) is the number of covariances to estimate.

For a modest sized portfolio of 10 assets, this is 65 different estimates. For the S&P500 this is 125,750 different estimates! If one is employing a dynamic optimization model with small amounts of data, this presents a major problem: the curse of dimensionality is a function of “combinatorial explosion”, where the volume of space increase so fast that the available data becomes sparse. In this context it becomes practically difficult to estimate variables. The amount of data required needs to increase exponentially commensurate with the degree of dimensionality.  This is a major problem since the biggest burden of computation lies with the estimation of the variance covariance matrix. In this situation, a simple equal weight portfolio is expected to perform comparatively well since it does not require any estimates. Research in fact demonstrates that an equal weight portfolio often outperforms mean-variance, minimum-variance and other heuristic methods especially on large data sets. This is especially true when factoring in transaction costs. Another factor is the use of historical returns which are notoriously difficult to estimate (while risk/volatility remains fairly predictable). In this case, historical returns represent the most unbiased estimator (which is good), but they also tend to be a simplistic/poor representation of time series data since it tends to contain a combination of trend, mean-reversion, and random noise. This leads to under-fitting since historical returns tend to dominate the weights of MVO. In statistical theory, this would place MVO as a “high-bias”/”low-variance” classifier.

There have been many different attempts to address these problems. The most frequent complaint about MVO is that small changes in the inputs lead to large changes in the output weights. Michaud– a pioneer in post-modern MVO- introduced re-sampling as a means of bagging/bootstrapping to improve upon out of sample performance and reduce estimation error. Re-sampled portfolios have greater diversification (their “transition maps” look better) but their out-of-sample performance is roughly equal if not sometimes inferior to traditional MVO. Where re-sampling seems to produce more benefit is with minimum variance portfolios and the potential reason will be addressed shortly. Bayesian methods that employ regularization attempt to reduce dimensionality explicitly by penalizing complexity. They “shrink” the observed data to a logical or well-structured prior estimate to prevent over-fitting. Ledoit-Wolf introduced a pioneering method to shrink the variance-covariance matrix method with this type of approach. Research shows that Ledoit-Wolf shrinkage often outperforms re-sampling, but tends to add value primarily in minimum variance portfolios rather than MVO.

In statistics, prediction error is a function of three factors (a good overview here): 1) bias– or the accuracy of the predictor/model (high bias is an overly simplistic predictor/under-fitting while low bias is a very close match/overfitting) 2) variance– this is the sensitivity of the estimate given different data (low variance means that a change in the data does not change the predictor very much, high variance means that a change in the data will change the required predictor) 3) noise– this is self-explanatory. High bias is associated with low variance, and low bias is associated with high variance. Essentially, a model that is overfit (low bias, high variance) is fragile if the future is unlike the past. In contrast, if a model is underfit (high bias, low variance) it is less fragile, but also fails to maximize predictive accuracy. Various methods such as bagging, boosting and regularization have been created to address these problems. All of these methods with the exception of boosting tend to reduce variance at the expense of raising bias. Variance reduction methods will improve out-of-sample performance when a predictor has high variance, and low bias, but they will have much less impact when a predictor has low variance and high bias.

In MVO, the historical return estimates can be considered high bias/low variance, the risk/volatility estimates are in the middle (since risk is a good linear predictor), and the covariances are low bias/high variance since there are so many to estimate. In terms of estimation error impact on the weights, the historical returns are the most important followed by risk and then by the covariances. This implies that MVO is primarily a high bias, low variance problem. Resampling is analogous to bagging/bootstrapping, which is primarily a means of reducing variance-it is less direct at trading off bias for lower variance than regularization/shrinkage and that is why it is less successful much of the time.  Since minimum variance optimization is partially a high variance/low bias problem, these methods have show success in improving out of sample performance. However both re-sampling and shrinkage are ill-equipped to adequately address MVO since the biggest driver is the return estimates which are high bias/low variance. Reducing variance at the expense of bias is the opposite of what needs to be done to improve MVO if returns are still being considered. This explains the failure for both methods to improve out of sample performance when used within MVO. Both re-sampling and shrinkage are adequate for only one component of MVO- the VCV (variance-covariance matrix). Improved estimates of returns using more complex models or more accurate predictors that can then have their variance reduced by the same methods is a more important direction. The same is true for risk/volatility estimates, although they are already so good using historical data that the margin for improvement is small.

The Jack Welch PSO Algorithm (JW-PSO)

September 12, 2013

jack welch

Jack Welch is a corporate icon, and arguably one of the greatest CEO’s of all time. One of the unique ideas he employed was a method (the 20-70-10 system) to optimize his workforce based upon identifying three different segments- the top 20% or the high value employees, the middle 70% that were mediocre but vital to the operation and the bottom 10% that were considered to be unproductive. In a previous post I talked about the “Jack Welch Portfolio Algorithm” concept in greater detail, and it is worthwhile reading to get some additional background on the subject. The 20-70-10 system has some interesting components that inspired me to create a new hybrid heuristic optimization algorithm.

Recently, I introduced readers to  Particle Swarm Optimization (PSO)and Micro Genetic Algorithms (mGA). For a very interesting and detailed analysis of these approaches, I suggest readers look at a newcomer to the blogroll- Butler’s Math. Andrew Butler is a mathematician with a background in optimization and he does an interesting comparison of how these algorithms perform in solving different mathematical functions. PSO happens to be a very practical and intuitive algorithm for solving portfolio problems and different equations for financial modelling. The major drawbacks to using PSO is that it can be slow to converge (relative to a deterministic or gradient optimizer) and also it can get more easily trapped in local optima than Genetic Algorithms (GA) which tends to do a better job of searching the optimization space. In traditional parlance, GA is better at exploration but PSO is better at exploitation, and both methods employ a lot of iterations- and hence are slow to run on a computer versus gradient solvers. PSO starts with a fixed number of particles that are randomly generated. This initial swarm is used to move particles together to find the optimum. Unfortunately, like chaos theory, PSO is sensitive to initial conditions- the location of the initial particles can affect the probability of finding the optimum and also the speed to reach a solution.  I suggested that hybrid algorithms be considered specifically to address the weaknesses of different algorithms . In this case Jack’s 20-70-10 system presents some good ideas for improving upon the original PSO concept.

In JW-PSO, the first key change on the original algorithm is to “fire” the 10% of the worst performing particles or the “C-players.” These particles are replaced with brand new particles. This change has two benefits: 1) it enables potentially superior exploration by allowing new particles to search the space and 2) it enables potentially superior exploitation and faster convergence by re-allocating resources away from potentially useless areas of the search space. The second key change is to move particles towards the “A-players” that are the top 20% in terms of the objective function. This also helps exploitation and potentially faster convergence by shifting particles towards the potentially more fruitful areas of the search space. The last change is to make the global best an “elite” global best that borrows the concept of elitism from mGA. The elite global best is the best position found so far since inception in terms of the objective function rather than the best within the swarm of particles at the current moment. This helps to ensure that the optimal position is not lost through successive iterations- causing the particles to waste computational resources. The new equations for the JW-PSO are presented below:

jw pso

 

 

I would encourage readers to experiment with this concept and to think of new and novel ways to extend hybrid PSO algorithms to make them more practical and efficient.

Social Learning Algorithms: Particle Swarm Optimization (PSO)

September 6, 2013

swarm1

The picture above is not a Sharknado. It is a school of fish that are travelling together in a swarm demonstrating the properties of intelligent social behavior. This observation along with other examples in nature helped to inspire the creation of Particle Swarm Optimization (PSO).

PSO is a robust stochastic optimization method based upon the behavior of swarms observed in nature. The method captures the concept of social intelligence and co-operation.  PSO was developed by a social psychologist- James Kennedy and an engineer-Russell Eberhart in 1995. The PSO method employs particles that exhibit individual and group behavior when looking for a solution within a search space. These particles also have a memory of where they have been. PSO is ideal for complex optimization problems with continuous variables- like finding portfolio or equation weights. I have created a simplified breakdown of the PSO methodology – which is actually much simpler and less esoteric than it appears. Essentially, PSO uses the familiar concept of momentum to move particles towards a solution.

pso equations

A graphic depiction of the tension between the global best and the particle best in determining the new particle position is presented below:

pso graphic

 

PSO is a very simple and useful methodology, and it has the added benefit of being much faster than Genetic Algorithms. However, its relative superiority is highly dependent on the type of problem. Typically the more continuous the variables are, and the less noisy and more continuous the search space is, the better its performance will be versus Genetic Algorithms–to make a broad and sweeping statement. But like anything else that is practical in the real world, it is best to consider hybrid methodologies that utilize the strengths of each method to create an even more flexible and efficient method.

 

 

 

Mating and Uniform Crossover

September 5, 2013

mating process

 

In the last post, we showed a basic blueprint for Micro Genetic Algorithms. It is worthwhile clarifying the process for mating to produce offspring. The picture to the left demonstrates a basic blueprint for mating within genetic algorithms. But the picture does not describe the process used to select which features to swap and how to perform the selection. It always helps to have a concrete example . The simplest method to change features with two parents to produce offspring is a uniform crossover. In this method a uniform random number is used to determine which features to swap. The purpose of producing offspring is to hopefully produce a better objective function than the best individuals/chromosomes within the original population. In the example below, we are trying to predict the success rate of being accepted to business school by examining different criteria across candidates. After performing tournament selection, we find two parents and now wish to produce new offspring:

uniform crossover excel

This is a very simple but effective way to produce offspring. There are a lot of more complex variations such as weighting the probability of switching features based upon the expected value for a particular feature: in other words, we may wish to choose a specific breakpoint for our uniform random number to reflect the probability of selecting an inferior feature. For example, if we believe that quintile order is predictive we may for example say that quintile 1 GMAT scores are better than quintile 2 scores. For this feature we may use the following rule:

if RAND()> .25 then quintile 1 GMAT,  if not then quintile 2

where RAND() is a uniform random number between 0 and 1

This reflects a probability of 75% that quintile 1 will be selected, and 25% that quintile 2 will be selected. It is possible to extend this concept in multiple ways, but the caveat is that you may restrict the search from finding a unique or counter-intuitive combination. A little a priori knowledge is useful, but too much can leave you trapped in a local optimum.

 

Micro Genetic Algorithm Schematic

September 3, 2013

mga

The Mighty (but humble) Micro-Genetic Algorithm (mGA)

August 30, 2013

dna

A Gentle Introduction to Optimization

The field of optimization has evolved significantly over the past few decades. Several new theoretical, algorithmic, and computational methods for optimization have been proposed to solve complex problems in diverse fields such as finance, engineering and molecular biology. In finance, optimization is required to solve portfolio problems, model/predict time series data, create trading systems, and for implementing trade execution.

Optimization methods can be divided into two primary categories; deterministic and heuristic approaches. Both methods are essential for quantitative finance applications. Deterministic optimization is a highly mathematical and gradient-based form of optimization that is most suitable for well-defined problems that contain a smooth and continuous search space. Some examples of deterministic optimization would be conjugate gradient, simplex, gradient descent, quadratic- programming (used for Markowitz-type portfolio problems),non-linear solvers, and quasi-Newton or Newton methods. These methods are greedy and highly efficient, and if used for the appropriate application they are guaranteed to find the global optimum. Deterministic methods are analogous to Formula One race cars: they are very fast and precise and will find the finish line as long as they are used on a race track in good weather conditions. You wouldn’t want to drive them through the forest without a map. Below is an example of the ideal type of optimization search space- a convex function:

convex function

Heuristic approaches in contrast are either algorithmic (ie a closed-form computation that approximates a feasible solution such as the “Minimum Correlation Algorithm”) or stochastic (rely on clever manipulations of random number generation to find a solution) and can be used for highly complex problems without the need for expert knowledge. Traditional examples of heuristic methods would be Monte Carlo (MC), particle-swarm optimization (PSO), genetic algorithms (GA), simulated annealing (SA), and differential evolution.  The class of algorithmic heuristic approaches are typically domain specific (or even specific purpose) and less generalizable than stochastic methods. Most scientists and engineers find heuristic methods to be more flexible and efficient than deterministic approaches for complex real-world problems, but finding the global optimum cannot be guaranteed. However, in a typical large-scale application a true exhaustive search is either not possible or practical from a computation standpoint. Heuristic methods are like a team of Hummers that can communicate or compete with one another to find the end goal. This provides greater chances of success through complex and unknown terrain that is often encountered when dealing with noisy time series data. An example of a search space that is only tractable with a heuristic approach (typically a GA) is the Rastrigin function:

rastrigin function

The Swiss Army Knife: The Micro-Genetic Algorithm (mGA)

The Micro Genetic Algorithm is a very simple but powerful approach that can solve the most complex problems with greater speed than most pure (non-hybrid) heuristic methods. Virtually any problem in finance can be solved using this method  as a building block. mGA is much faster than traditional Genetic Algorithms (SGA) and produces superior solutions without the need for estimating several additional parameter inputs such as the mutation rate. mGA are also ideal for parallel processing and can dramatically reduce both programming and processing time as well as memory requirements. In the next post, I will  break down the steps to create an mGA application in more detail. For now here is a flow chart of how the mGA works:

micro genetic algorithm

Filtering White Noise

April 23, 2013

Image

Most asset return processes can be characterized as containing a primary trend, along with mean-reversion around that trend, as well as a certain amount of random noise.  Econometricians classify these elements using a Hurst Exponent as either : 1)black noise (trending/positive autocorrelations- Hurst>.5) 2) pink noise (mean-reverting/negative autocorrelations- Hurst<.5) or 3)  white noise ( no trend/mean-reversion, low/insignificant autocorrelations- Hurst=.5). Intuitively traders wish to capitalize on either the trend or mean-reverting behaviour- often at different time frames since they are part of the same unified process (trends tend to occur at longer time frames, and mean-reversion around that trend at shorter time frames). The key obstacle for both styles is to eliminate or minimize the impact of white noise on indicators that are used to measure either trending or mean-reverting behavior. The failure to do so results in poor trading results due to false/random signals.

Consider two charts of the same time series (from Jonathan Kinlay’s good blog) one is a black noise process while the other contains a white noise process:

Black Noise Fractal Random Walk

white noise random walk

In the first chart- with a black noise process- it is easy to see how profitable it might be to use a simple moving average to trade the underlying–there is very little noise to speak of that is not self-reinforcing (trending). In the second chart- a white noise process- you can see the similarity to real financial time series. There appears to be a fair amount of random noise, and it would be more difficult to trade for example with a moving average. The chart below shows a pink noise process, and looks familiar to those who trade pairs as a log of the ratio of two asset prices that are cointegrated (ie like one sector ETF versus the same sector from a different ETF provider).

pink noise

Notice that this process appears to have a stationary mean and predictable negative autocorrelation. It would be impossible to trade this series using a moving-average based trend strategy. However, this would be an ideal dataset to trade using runs (ie buy on a down day short on an up day). In practice, time series data contains elements of all three types of noise and thus what we want to do is to filter out the white noise which is less predictable and obscures otherwise predictable asset behavior.

A recent paper was written by a colleague- George Yang-  that sheds light on how to go about filtering random/white noise elements and also shows the practical impact on trading system profitability. The paper recently won a prize in the prestigious Wagner Award competition which is run through NAAIM. Mr. Yang shows that one can filter out “insignificant” data using a rolling or historical standard deviation threshold and extend indicators to use only “significant” data. For example, if one were to use a 200-day moving average on the S&P500, you might stipulate that market moves between .25% and -.25% are too small to be considered significant in defining the trend. That is, a small up or down day (or series of small days) may cause a trade which will not signal a true change in the underlying trend. This can also be translated for example as a fraction of a rolling standard deviation unit. To calculate the true 200-day moving average in the first case, one would eliminate all insignificant days from the data set and count back in time until there were 200 days of significant data to calculate the moving average. The results in the paper demonstrate that this type of filtering is effective increasing the signal to noise ratio and improving trading results across a wide range of parameters. The paper also shows the same technique is effective at improving a short-term mean-reversion system using runs. This highlights the potential of applications that can filter white noise from the data.

There are multiple extensions to improve this concept, many beyond the scope of this post. However, one seemingly obvious method would be to also filter insignificant days as also requiring trading volume to also be insignificant– presumably volume that is below average would signify a lack of conclusive agreement on the current market price. On the flip side a seemingly small market move accompanied by very heavy trading volume could be a warning sign. Another method could look (on George’s suggestion) at the high to low range for the day in relation to the past (ie like DV2). Presumably a tight daily range implies insignificant movement, while a wider range is more informative. One can picture using multiple filters to enhance the ability to identify truly significant from insignificant trading days. This would in turn significantly improve trading signal performance or forecasting ability.

Minimum Variance Algorithm Comparison Snapshot

April 19, 2013

The Minimum Variance Algorithm was compared to several standard optimization methods and algorithms in a recent set of tests done by Michael Kapler of Systematic Investor.  Michael created a webpage for MVA to review some details of these tests and also to summarize some of the background information.  We plan to release a whitepaper on MVA with some additional material in the coming weeks. Below is a summary of testing done across multiple data sets contained in the MCA paper.  We used a standardized score (the normsdist of the z-score) of the performance of each method versus other methods using three metrics: 1) sharpe ratio (higher is better) 2) volatility (lower is better) 3) portfolio turnover (lower is better). These factors were weighted equally to create a composite score. We tested across a wide range of data sets– stocks, ETFs and Futures. The Minimum Variance Algorithm (MVE in the chart below) scored the highest of all methods across datasets- outperforming standard minimum variance and also the minimum correlation algorithm.

mva summary

 

The following acronyms are defined below.

MVE:  Minimum Variance Algorithm (MVA) in Excel

MCE: Minimum Correlation Algorithm (MCA) in Excel

MC: Minimum Correlation Algorithm (MCA)– Whitepaper/R Version

MC2: Minimum Correlation Algorithm 2 (MCA)

MV: Minimum Variance – standard minimum variance using a quadratic optimizer long only

MD: Maximum Diversification-standard maximum diversification using a quadratic optimizer long only

EW: Equal Weight

RP: Risk Parity- basic version inverse volatility weighting

 

Minimum Variance Algorithm (MVA) Excel Sheet

April 5, 2013

The link below contains a spreadsheet example for computing MVA weights from multiple times series. Next week, I will try to show some different applications.

Minimum Variance Algorithm

 

Design a site like this with WordPress.com
Get started