Stories by Patrick David on Medium

Stop using ChatGPT

Patrick David — Thu, 19 Jun 2025 16:56:35 GMT

… There is a better way …

Welcome to GPT Island 🏝️

Stop Using ChatGPT - Start Using GPT Island 🏝️.

Let’s be honest, ChatGPT is an amazing AI but a not-so-amazing UI — tab switching, clunky interface, waiting around for answers, no control over context memory, unpredictable token limits, increasingly cluttered UI, breaking your flow with constant context shift …

… there is a better way

It’s called GPT Island 🏝️ it keeps all the best of ChatGPT and solves all the aforementioned problems.

You will never use ChatGPT.com again.

So… what is GPT Island?

GPT Island 🏝️ is a beautiful Chrome extension that brings the power of ChatGPT (and DeepSeek) directly to every webpage. It places a beautiful chat UI at the bottom of every page with a set of unique features to make interacting with ChatGPT a joy.

GPT Island in action

Why GPT Island 🏝️ is Better Than ChatGPT

🖼️ 1. Persistent UI — Wherever You Are

GPT Island 🏝️ adds a beautifully designed, intuitive chat interface right onto every webpage. Reading an article? studying for exams? Doing research? Just start typing — GPT Island🏝 is always there, without interrupting your flow.

🔄 2. Tab Sync That Actually Works

Stop switching back and forth from your work to you ChatGPT tab, GPT Island 🏝️syncs your conversations across tabs. Jump between tasks without losing your place, your chat, or your thoughts.

🧠 3. Selective Memory — You’re in Charge

GPT Island 🏝️ introduces a unique memory control feature. You choose what gets remembered – Have you ever got frustrated with ChatGPT not remembering that first prompt you gave? Ever wanted ChatGPT to ‘forget’ a particular prompt or response? GPT Island solves this problem – simply click the 🏝️ icon on any message and it gets added to your context memory. How many tokens are in memory? we got you! GPT Island shows your context memory token count in the memory button. want to clear context memory? Just click the memory button!

💾 4. Local Chat Storage

Your conversations are stored locally — not in the cloud, not on someone else’s server. That means full control over your data and peace of mind.

🚀 5. DeepSeek Access

GPT Island 🏝️ doesn’t limit you to ChatGPT. You also get access to DeepSeek, a powerful, high-quality reasoning model for nuanced, accurate reasoning — perfect for research, writing, and ideation.

💬 6. Huge 5 Million Tokens Per Month

That’s right. You get up to 5 million tokens per month. More context. More memory. More capability. No more worrying about running out of tokens during complex tasks. You can get started with our free option and 5000 tokens.

💬 7. Bring your own API key option — BYOAK

Already have an openai chat completions API key ? with GPT Island you can have the best of both worlds — ChatGPT’s powerful language models and GPT Islands 🏝️beautiful UI.

GPT Island

Who is GPT Island 🏝️ For?

Writers looking for in-context assistance while drafting
Developers needing fast AI help while coding
Researchers analyzing articles and data
Anyone who wants ChatGPT — but beautiful, more flexible, more powerful and more embedded into daily workflows.

The Bottom Line

GPT Island 🏝️ turns ChatGPT from a tool you visit into a tool that lives with you online. It’s faster, more beautiful , more powerful — and it gives you the control you’ve always wanted. you’ll never go back 😎

So, stop using ChatGPT like it’s 2022.
Start using GPT Island 🏝️.

visit gptisland.com for more details.

Ready to try it?
👉 Install GPT Island 🏝️ from the Chrome Web Store

Build a BitCoin(tegration) Backtester

Patrick David — Tue, 19 Feb 2019 11:02:30 GMT

Learn the statistical technique of Cointegration and build your own crypto backtester to create and test a quantitative trading strategy.

This tutorial is in 2 parts — (you can run the backtester as a separate standalone module) :

Learn the Statistical technique of Cointegration.
Build a Bitcoin Backtesting engine using Python to analyze the performance of a Cointegration based trading strategy.

Just want the code? click here.

What are we building

We are going to build a python based event-driven backtester that pulls 2 crypto securities Bitcoin (BTC)and Bitcoin Cash (BCH) from an API, passes it through a trading strategy that uses the mean reverting cointegration spread between the 2 securities and generates buy/sell signals when the spread hits ± 1 stdev. We then send these signals to the Portfolio class which handles the logic of the backtester. One time stamp will be pulled and processed at a time, allowing us to see what would have happened tick-by-tick. Finally we print the results to console (or jupyter notebook) and print out the PnL (profit and loss).

Cointegration

roadmap:

what is time series
model assumptions
why does this happen
what is stationarity
orders of integration
cointegration

Time Series

To understand Cointegration we first need to look at time series.

In cross sectional (non time series) regression models, if we are trying to predict some output value ‘Y’ we would have (one or more) corresponding input feature values ‘X’, we would learn some mapping between X and Y using something like least squares regression on a training set, then assess the performance on a test set. All very straight forward.

Regular (cross sectional) multiple regression model

In the case of time series, instead of using exogenous features ‘X’ we can use lags of the target output ‘Y’ in the form:

Basic AR (autoregressive) time-series model

In simple terms, our predictor for today's observation is yesterdays observation. This is known as an Autoregressive model or AR model.

Model Assumptions

Most statistical models and techniques make the (often unrealistic) assumption of the input data being iid (independent and identically distributed); each data point being independent and all drawn from the same probability distribution.

For regression models we make the following assumptions:

Independence —Pr [ rank ⁡ ( X ) = p ] = 1. The input features (X) are statistically independent from each other. A full rank feature matrix.
Linearity — y=b0 +b1x1…btXt. The relationship between dependent and independent variables is linear.
Homoscedasticity — E[ εi² | X ] = σ². The variance of the errors is constant.
Normality — ε ∣ X ∼ N ( 0 , σ²I n ). The error terms are normally distributed.
No Autocorrelation — E[ εiεj | X ] = 0 for i ≠ j. The error terms are uncorrelated.
Strict Exogeneity — E ⁡ [ ε ∣ …Xt-1, Xt, Xt+1… ] = 0. This assumes the errors are mean zero E[ε] = 0, and the errors are uncorrelated with the input features E[X.ε] = 0. Crucially this means each error term must be uncorrelated with every value of X past present and future.
Weak Exogeneity (Optional — we’ll come back to this) — E ⁡ [ ε ∣ …Xt-1, Xt] = 0. Similar to ‘strict’ form except expectation only applies to current and past values (not future values of X).

Failure to meet these any or all of these assumptions, can cause our models, whether inference or prediction, to be inefficient, inaccurate, incorrectly significant or harder to interpret than necessary. See here for more information.

However, when we try to extend these regression assumptions from the cross-sectional domain to time-series, the two assumptions that are hardest to meet are the last two: Lack of Autocorrelation and Strict Exogeneity . In particular if these conditions do not hold then our regression coefficient estimate is biased as is the variance of the coefficient estimate.

Why does this happen?

The Law of Large Numbers (LLN) states that if our variables are iid (and have a finite expected value), then the sample means, variances and covariances will tend towards the true, population moments.

Sample moments ==> Population moments

When we have this nice asymptotic property of sample moments tending towards population moments, our regression estimators are efficient and unbiased, our standard errors and confidence intervals are trustworthy.

However, if we modify the above formula to make xi a function of time, xt, then the sample mean no longer converges to the population mean, it diverges to infinity!

The mean diverges to infinity!

This example also extends to the variance and covariance diverging too. It should be obvious now that if we were to run regression or any statistical analysis on this time trend data, that our results would be unreliable. Furthermore, in some situations, as we increase the sample size, this can make our model even worse!

This is an example of non stationary data.

A more subtle example shows a time series that we call ‘cyclo-stationary’:

y(t)=μ + Acos(2πt) + e

in this example if we calculate the full sample mean, y/n does indeed converge to the population mean μ. However if we choose a fixed time window as our sample, say t` (tee prime), then we converge to a different mean: μ+Acos(2πt′). Note, this is still a constant mean, but the two means clearly are not equal:

μ != μ+Acos(2πt′)

This is an example of a time series that is (cyclo) stationary but not ergodic.

Only if we have both a stationary and ergodic time series, can we loosen our assumption of Strict Exogeneity to Weak Exogeneity. This allows us to meet the model assumptions and have a reliable model.

What is stationarity?

So we’ve learnt something about stationarity, but what is a stationarity?

Definition: If we take a time series {Xt} (or any sequence of random variables) and define the joint distribution of a consecutive sub sequence as Fx(Xt1…Xtn), then define a 2nd joint distribution from a 2nd sequence Fx(Xt1+ 𝜏…Xtn+ 𝜏) then for all 𝜏,t,n, if Fx== Fx, we have a stationary series in {Xt}.

If 2 random vectors have same distribution they are Stationary.

In words, a stationary process is one whose joint probability distribution doesn't change with a shift in time.

The above definition implies Strict Stationarity. If instead of the whole distribution being the same, we just have the mean and covariance consistent throughout the time series, then we have Weak Stationarity.

Note: iid is a stronger assumption than stationarity because stationarity makes no assumption about the data being independent, just that they are identically distributed.

All iid sequences are stationary but the reverse does not hold true.

Integration

The penultimate step before we get to Cointegration, is the concept of Integration denoted as I(i). Lets define a simple time series where the regressors are the error terms, defined as Y-Ŷ = ε, the true value minus the predicted value. The bj terms are the ‘weights’, denoting how much each error term influences Yt.

Moving Average (MA) series

Given this moving average series, if the following condition holds, then we call the series I(0). This conditions states that the autocorrelation (influence of the error terms on Yt’s) decays such that the variance of bk doesn't blow up to infinity. The mathy term is ‘square summable’.

Weights decay

Note: I(0) is necessary for stationarity but not sufficient, So all stationary series are I(0), but not all I(0) are stationary.

If we cumulatively sum an I(0) series we get an I(1) series. The following python code generates an I(0) series sampling from a standard normal, then cumulatively sums those values to get an I(1). Note how the I(1) looks remarkably like a stock price chart! We could reverse the I(1) by taking the 1st difference of the series, by taking each price minus the previous price:

x = pd.Series(index=range(1000))

#generate samples from standard normal

for i  in range(1000):

x[i] = (np.random.normal(0,1))

x_i_zero = x

#cumulatively sum the I(0) series to make it I(1)

x_i_one = np.cumsum(x)

plt.plot(x_i_zero, label = 'I(0)')

plt.plot(x_i_one, label = 'I(1)')

plt.legend()

I(0) cumulatively summed = I(1)

Cointegration

At a high level, if a linear combination of two or more non-stationary time series is stationary, then the entire set of time series is considered cointegrated.

Definition:

Given a set of time series (or any sequence of RV’s) {X1, X2, …, Xk}, if all series are I(1) as is usually the case with financial data, then if some linear combination of them evaluates to an I(0) series, we call the set of time series Cointegrated.

Formally, we are building a linear model (which we will see later can be done with regression) where the X’s are individually I(1) and therefore non stationary, that gives us a new singular time series Y, that is I(0) and stationary.

Y=b1X1+b2X2+⋯+bkXk

For example, if X1, X2, and X3 are all I(1), and the linear combination of 5X1+3X2+0X3=5X1+3X2 is I(0). Then in this case the time series set (X1, X2, X3) are cointegrated.

So how does this help us build a trading strategy? Well,

if we can find 2 or more time series that are cointegrated, then that cointegrated time series, by definition would be I(0 ) and mean reverting. So we could generate signals whenever the series moved far away from its mean on the expectation that it will move back to the mean over time.

Building a BackTester

Lets build a BitCointegration BackTester! (Now it makes sense right?)

roadmap:

hypothesis testing
the strategy
building the backtester
backtesting pitfalls

Our hypothesis

Before we begin our analysis and building of the backtester, we need to start with a hypothesis as to why two or more securities might be cointegrated.

This is an important starting point. If we simply scanned every tradable instrument over all time periods, then we would undoubtedly find a pair of instruments that showed cointegration. This is the curse of multiple comparison bias. Put simply, if you look at enough data, you will eventually find a result which matches your desired outcome, regardless of its statistical significance. Its crucial to understand this before we start and I have an entire research piece on this topic here.

Stocks, Significance Testing & p-Hacking: How volatile is volatile?

For our research, we will be using Bitcoin (BTC) and Bitcoin Cash (BCH). The base economic rationale for this is simple: BCH is a fork of BTC, therefore our hypothesis is that the 2 instruments *might* be cointegrated. We keep it as simple as that and then test this hypothesis.

Note: my post today is not about bitcoin or blockchain or the relative merits of one instrument over the other. There are plenty of other forums for that!. My aim is to teach the concept of cointegration and how to test for it statistically and how to build a backtester from scratch along with the many pitfalls. To get a quick overview of the difference between BTC and BCH read here.

The strategy

Assuming that the analysis that we are about to do, finds that BTC and BCH are cointegrated, then based on what we have learned so far, we know that a linear combination of BTC and BCH, if cointegrated, will be stationary and therefore mean reverting. We will use this property to build a trading strategy, specifically as we’ll see in the next section, we will short the spread between BTC and b*BCH (b is a weight, that we will calculate) when it rises above 1 standard deviation (upper blue line) and long the spread if it moves below the lower blue line. Crucially we will take profit and close out the positition when the spread hits the mean (read line). Note, we dont have any stop loss implemented in this strategy, but it could easily be added.

buy/sell signals generated at ±1std

When we go long the spread, this means we buy BTC and sell b*BCH. When we generate a short signal we sell BTC and buy b*BCH. Because this type of pairs trading strategy can get quite complicated we will restrict each buy/sell amount to $1000 each time we execute a trade.

Example: if we generate a short signal, we would sell $1000 of BTC and buy $1000 of b*BCH

Th eagle eyed reader will notice that we are actually not buying $1000 of BCH but $1000 times some multiplier ‘b’ times BCH. Otherwise we would simply be trading the raw spread which is not cointegrated! We will account for this when we code the Portfolio class.

Here’s an overview of the Classes and methods we’ll be building:

DataPuller — to pull, align and organize the data from API
Portfolio — handles the logic of the cointegration pairs trade
Strategy — runs event-driven backtester with ‘Portfolio’ as base class

Lets get started with the usual imports:

#standard imports

import requests
import numpy as np
import pandas as pd
import seaborn as sns
from scipy import stats
import matplotlib.pyplot as plt
%matplotlib inline

#nice trick to make plots full width

plt.rcParams['figure.figsize'] = [15,5]

We are going to use the CryptoCompare API to get the price data:

#fetch daily OHLC prices for BTC, BCH

btc = requests.get("https://min-api.cryptocompare.com/data/histoday?fsym=BTC&tsym=USD&limit=500").json()['Data']
bch = requests.get("https://min-api.cryptocompare.com/data/histoday?fsym=BCH&tsym=USD&limit=500").json()['Data']

Next we put the data into a Pandas Dataframe and change the time column to a proper Pandas DateTime object and use the rename function to change the duplicate columns of ‘close’ to unique names, ‘btc’ and ‘bch’. We also select our starting date as 2017–12–12 (more on this later).

#put into dataframe

btc_df = pd.DataFrame(btc)
bch_df = pd.DataFrame(bch)

#use pandas datetime feature to convert timestamp into a datatime object with units = seconds

btc_df['time'] = pd.to_datetime(btc_df['time'], unit='s')
bch_df['time'] = pd.to_datetime(bch_df['time'], unit='s')

#use the newly created datetime object as index

btc_df.set_index('time', inplace=True)
bch_df.set_index('time', inplace=True)

#rename 'close' for each instrument so they have unique names

btc_df.rename({'close':'btc'}, axis=1, inplace=True)
bch_df.rename({'close':'bch'}, axis=1, inplace=True)

#select our desired stating data
btc_df = btc_df.loc['2017-12-12':]
bch_df = bch_df.loc['2017-12-12':]

So here’s one of our cryptocurrency dataframes:

btc_df.head()

For our purpose we just want the closing prices of both BTC and BCH, so we will use the concat function in pandas to merge just the closing price columns:

#we'll work with just the closing pries for this project, so concatenate the 2 columns together.

df = pd.concat([btc_df['btc'], bch_df['bch']],axis=1)

#we'll also add the raw spread as a column

#calculate the spread between the 2 prices, for reference only.
#We will be trading the 'cointegration spread' instead.

df['spread'] = df['btc'] - df['bch']
df.head()

So that's our dataframe sorted, now we need to test the 2 time series, BTC and BCH to see if they are cointegrated. To do this we will import the adfuller and coint modules from statsmodels and select a training sample from both of our cryptocurrencies. We choose a 5 month window from the beginning of 2018. We also create a ‘spread’ series showing the difference between BTC and BCH just for reference.

The function coint basically fits a regression model, like we have already discussed and tests the null hypothesis that there is no cointegration, meaning we want to see a small p-value, so we can reject the null.

adfuller tests for a ‘unit root’ which would indicate the series is non stationary. Again we want a small p-value.

adfuller implements the Augmented Dickey Fuller test for stationarity.

coint implements the Engle Granger 2-Step method for cointegration testing.

#test for cointegration

from statsmodels.tsa.stattools import coint, adfuller
import statsmodels.api as sm

#select a training sample

btc_train, bch_train = df['btc'].loc['2017-12-12':'2018-4-30'], df['bch'].loc['2017-12-12':'2018-4-30']
spread_train = btc_train - bch_train

Lets throw our training set into the coint function and what we’re looking for is a p-value (see here for more on p-values) below a 5% significance level:

This will imply BTC and BCH are cointegrated over the training period:

#return p value
#coint returns 3 values t stat, p-value and critical value
#in python we can unpack all three on one line

t,p,crit = coint(btc_train,bch_train)

#test for significance
print(p)
if p <0.05:
    print('Cointegrated!')
else:
    print('NOT Cointegrated')

Great! Bitcoin and Bitcoin Cash appear to be cointegrated. Well if they’re cointegrated the spread between them must also be stationary right?

#use adf to test for stationarity

pval_spread = adfuller(spread_train)[1]
if pval_spread <0.05:
    print(pval_spread,'Data is Stationary!')
else:
    print(pval_spread, 'Data is NOT Stationary!')

#note the spread itself is Not stationary as it assumes a 'Beta' value of 1
#so we need to construct a linear model to find the optimal Beta value...

Oops! The spread’s not stationary

So whats happened here? remember we defined cointegration as being a linear combination of the the time series that are stationary, not simply the raw 1-to-1 spread. But how do we find this linear combination? Well one technique we already know for finding a linear combination is linear regression!

If we have 2 time series X1,X2 then if we can define a linear model as:

which is therefore cointegrated and if we rearrange the algebra we can get:

In other words, if there is a linear combination of X1 and X2, that gives a spread which is I(0), then by definition the spread is stationary and mean reverting.

So what we need to do is build a simple linear model between BTC and BCH and use the slope coefficient ‘beta’ from that equation to build our stationary spread series defined as ‘z’:

#build linear model to find beta that gives I(0) combination of pair

X = sm.add_constant(bch_train)
result = sm.OLS(btc_train,X).fit()

#result.params returns the intercept (const) and slope of the model. #We can ignorethe intercept and use 'b' to build our cointegrated #series!?


print(result.params)

#define new stationary spread as 'z'
#'b' value gives the parameter of our linear model

b = result.params['bch']

#simply define our new cointegrated series as z = btc - b*bch
z = btc_train - b*bch_train

intercept and slope coefficients of linear model

Now if we run the augmented dickey fuller test on this new linear combination of BTC and b*BCH:

#run adf again, this time on linear combination 'Z'

plt.plot(z)
z_pval = adfuller(z)[1]
if z_pval<0.01:
    print(z_pval,"Huzzah!, it's Stationary")
else:
    print(z_pval,":Not stationary")
plt.axhline(z.mean())

Huzzah it’s stationary!

It’s stationary! This means we have found a linear combination of BTC and BCH which is stationary (at least over the training period).

Lets think about what the series ‘z’ actually is and how we can construct a trading strategy from it. ‘z’ represents the difference (spread) between 1 unit of BTC and 3.99 units of BCH, which we have shown to be stationary. Without going too deep into the inner workings of ADF (Augmented Dickey Fuller) it checks for a ‘unit root’ which is a fancy way of saying the moments (mean, variance etc) depend on time ‘t’ and are therefore non-stationary. We want to reject the null hypothesis that a unit root exists.

Lets produce some plots to show visually what we have done. The following code plots 3 charts

The raw spread between BTC and BCH (not cointegrated), we use this as a reference.
THIS IS IMPORTANT — we plot the full time series of BTC — b*BCH (training + test set) ASSUMING that stationarity hold not just for the training set but ALSO the test set (we will see later the consequences of this).
Shows a plot of the daily returns, we don't actually use this series but we include it as a potential alternative predictive feature that we might want to test.

This final point (3) highlights the fact that I have arbitrarily selected ± 1 standard deviation on the cointegrated spread, as our trading strategy. We could just as easily use daily returns breaching 1.2345 stdev as our signal, or anything else!

Marked in green is the end of training set|begining of test set. We will see later why our assumption of stationarity in the training set holding true for the test set, is a bad idea!

#calculate cointegrated series 'full_z' for the whole (train + test) dataset

spread = df['spread']

full_z = df['btc'] - b*df['bch']

#lets plot the raw spread, the stationary spread and for reference the 'spread daily percent change' or 'returns'
#the green vertical line shows the end of the training set period.

fig,ax = plt.subplots(3,1,sharex=True)

plt.tight_layout()
ax[0].set_title('Spread')
ax[0].plot(spread)
ax[0].axhline(spread.mean(),color='r')

#stationary series 'z' plotted with 1 standard deviation horizontal bars shown
#note standard dev bars are arbitrary and could be anything

ax[1].set_title("Linear model 'z'")
#plot inverse so its same as 'Spread'
full_z_mu = full_z.mean()
ax[1].plot(full_z)
ax[1].axhline(full_z_mu+full_z.std(),ls ='--')
ax[1].axhline(full_z.mean(),color='r')
ax[1].axhline(full_z_mu-full_z.std(),ls ='--')

#spread pct change  / returns with 1 standard deviation horizontal bars shown

spread_pct = spread.pct_change(1)
#print(new_diff.head())
#print(new_df.head())
ax[2].set_title('Spread daily % change')
ax[2].plot(spread_pct)
ax[2].axhline(spread_pct.std(),ls='--')
ax[2].axhline(spread_pct.mean(),color='r')
ax[2].axhline(-spread_pct.std(),ls='--')

#mark end of training sample in green
for i in range(3):
    ax[i].axvline('2018-4-30',color='g')
#new_diff.rolling(20).mean().plot(style='r+')
#plt.axhline(color='r')
#plt.text(390,0,'ZERO')
#new_diff.rolling(10).mean().plot(style='--')

spread v ‘stationary spread’ v daily returns

By looking at the raw spread between BTC and BCH compared to the stationary spread, we can get a visual confirmation that the linear modeled spread ‘z’ appears to be reasonably bounded between ±1 std (blue lines) and centered around a constant mean (red line). This property seem to hold beyond the training set period (green line) too, but more on this later. The daily returns shown in the lower of the 3 plots, is just for reference and to show how volatility seems to correlate with major changes in the spread.

The Backtester

And finally we get to the actually backtester!

The first component we will build is the Data_Puller class.

we begin with the __init__() function by defining some variables we will use throughout the backtester; ticker1, ticker2 for holding the crypto pairs and a pandas dataframe named df3 to store the final results.

we define 2 functions (actually they’re methods because they are within a class):

get_data()- to pull and merge the data from the API

fetch_data()- to return the final dataframe so we can pass it to the next component.

#Data_Puller fetches crypto data, cleans then passes to container df3

#Class to store data for any pairs, crypto or otherwise
class Data_Puller:
    def __init__(self,ticker1,ticker2,freq,periods):
        self.ticker1 = ticker1
        self.ticker2 = ticker2
        self.freq = freq
        self.periods = periods
        self.df3 = pd.DataFrame()
        
        
        
    #method to pull, munge, store crypto pairs data
    def get_data(self):
        #replace this in final merge
        b = 3.995977
        _data1 = requests.get(f"https://min-api.cryptocompare.com/data/histo{self.freq}?fsym={self.ticker1}&tsym=USD&limit={self.periods}").json()['Data']
        _data2 = requests.get(f"https://min-api.cryptocompare.com/data/histo{self.freq}?fsym={self.ticker2}&tsym=USD&limit={self.periods}").json()['Data']
        df1 = pd.DataFrame(_data1)
        df1_close = df1['close']
        df2 = pd.DataFrame(_data2)
        df2_close = df2['close']
        
        df1['time'] = pd.to_datetime(df1['time'],unit='s')
        df1.set_index(df1['time'], inplace = True)
        
        df2['time'] = pd.to_datetime(df2['time'],unit='s')
        df2.set_index(df2['time'], inplace = True)
        df1 = df1.drop(['high','low','open','volumefrom','volumeto','time'] ,axis=1)
        df2 = df2.drop(['high','low','open','volumefrom','volumeto'] ,axis=1)
        df1.rename(columns={'close': 'BTC'}, inplace=True)
        df2.rename(columns={'close': 'BCH'}, inplace=True)
        #print(df1.head())
        #print(df2.head())
        self.df3 = pd.concat([df1,df2],axis=1)
        #self.df3['spread'] = self.df3[self.ticker1] - self.df3[self.ticker2]
        #self.df3['spread_pct_change'] = self.df3['spread'].pct_change()
        #add cointegration model X1 - X2 = should be stationary
        self.df3['full_z_coint'] = self.df3['BTC'] - b*self.df3['BCH']
        self.df3['b_x_bch'] = b*self.df3['BCH']
        
        #prints df to check data
        print(self.df3)
        
    #returns the final dataframe, with 1st element dropped as its nan for spread_pct_change    
    def fetch_df(self):
        return self.df3.loc['2017-12-12':]

To show the output of this class, lets instantiate the class and pass in the arguments (‘BTC’,’BCH’,’day’,500) , day is the frequency and 500 is the number of days.

to display all the results we can use pd.set_option(‘display.max_rows’, 400) to show the first 400 entries.

x = Data_Puller('BTC','BCH','day',500)
#pd.set_option('display.max_rows', 400)
x.get_data()

#instantiate Data_Puller class then fetch_data
q = x.fetch_df()

At this stage its a bit ugly, but is has all the data we want. The full_z_coint column shows the spread between BTC and b*BCH (its the same as the ‘z’ variable we defined earlier in the post), this is effectively the ‘instrument’ we are going to trade as its stationary and mean reverting. Of course to trade this ‘spread’ we actually need to take a long position and a short position in BTC or b*BCH.

The variable b*BCH shows the value of BCH multiplied by the learned parameter ‘b’ which is approx 3.99, as we derived earlier.

So that’s Data_Puller. Next up is the Portfolio class. This is the most complex component as it does all of the heavy lifting in terms of trading logic and execution.

Lets walk through it.

The __init__() function defines a dataframe called _port() and we pass in the column names;

ts=time stamp, signal= buy sell or hold (this logic will be built in the Strategy class next), action=indicates what action we took given the signal(bought,already bought, closed out etc), sold/bought value=dollar value of trade, U_pnl=the unrealized profit/loss showing the running pnl, R_pnl=realized profit/loss once we have actually closed out a position.

Next we initialize a few variables that will track what position we currently hold, running pnl etc.

Next is the close_out() function. This is important as it will be used throughout the backtester logic to close any open position when the trigger is met. We define the close out trigger event as being when the price hits or crosses the mean value.

What follows is the logic to handle each of the possible trade signals, which will be generated in the Strategy class; Hold, Long, Short. In each of these situations we do the following:

check current_pos to see if we already have a position
calculate new sell/buy units by taking our $1000 initial value and adjusting to today's price and quantity
if we have a position and the new price has crossed the mean value ‘close out’ threshold, then we run the close_out() function
update current position to reflect any changes
print out any actions to console
pass the new current values down to the end of the class to be added to our _port portfolio dataframe

#Portfolio class handles trade logic
class Portfolio:
    def __init__(self):
        #self.orders = pd.DataFrame(columns=['TS','Order','tick1','tick2'])
        self._port = pd.DataFrame(columns=['ts','signal','action','sold_value','bought_value','U_pnl','R_pnl'])
#        self.current_budget = 1000000
        self.signal = None
        self.prev = None
        #bought / sold
        self.current_pos= "empty"
        #self.pnl = pd.DataFrame(columns = ['pnl'])
        self.bought_sold_price = 0
        self.stamp = 0
        #self.sold_value = 0
        #self.bot_value = 0
        self.sell_units = 0
        self.buy_units = 0
        self.value_2 = 0
        self.value_1 = 0
        self.rpl = 0
    
    def close_out(self):
        self.rpl += (1000 - self.value_2) + (self.value_1 - 1000)
        self.current_pos ='empty'

print("close out position")
        
          
    def position(self,ts,tick1,tick2,price,tot_trade_amount=2000):
        print()
        print(self.stamp)
        print('current pos:',self.current_pos)
        print("bought / sold price: ",self.bought_sold_price)
        print('this is prev:', self.prev)
        print('this is the signal:',self.signal)
        single_trade_amount = tot_trade_amount/2
        action = None
        
#logic for Hold signal        
        
        if self.signal =="Hold":
            
            if self.current_pos =='sold':
                print("sold tick")
                self.value_2 = self.sell_units * tick2
                self.value_1 = self.buy_units * tick1
            elif self.current_pos == 'bought':
                self.value_2 = self.sell_units * tick1
                self.value_1 = self.buy_units * tick2
            else:
                print("Hold neither bought nor sold")
                self.value_2 = 0
                self.value_1 = 0
                
                
                
          
            print("hold 1")
            print("caputured by Hold")
            
            if self.current_pos == 'bought' and price > self.mu:
                print("hold 2")
                self.close_out()
                action = "Closed out Long"
                #self.current_pos ='empty'
                
            elif self.current_pos =='sold' and price < self.mu:
                print("hold 3")
                self.close_out()
                action = "Closed out Short"
                #self.current_pos ='empty'
            else:
                print("hold 4")
                print("""take no action -> Hold""")
                action = "Held"
        
#logic for Short signal        
        
        elif self.signal =='Short':
            
            print("caputrd by Short")
            sell_units = single_trade_amount/tick2
            buy_units = single_trade_amount/tick1
            
            if self.signal == 'Short' and  self.signal != self.prev:
                print("short 1")
                if self.current_pos == 'bought':
                    self.value_2 = self.sell_units * tick2
                    self.value_1 = self.buy_units * tick1
                    self.close_out()
                elif self.current_pos == 'empty':
                    
                    print("short 2")


                    print("Went short: sold",sell_units,"units of BTC","at a price of",tick2, "and bought",buy_units,"of b*BCH at a price of",tick1)
                    #self.sold_value = sell_units*tick2
                    #self.bot_value = buy_units*tick1
                    self.sell_units = sell_units
                    self.buy_units = buy_units
                    self.value_2 = self.sell_units * tick2
                    self.value_1 = self.buy_units * tick1

self.bought_sold_price = tick2 - tick1
                    self.current_pos = 'sold'
                    action = "Went Short!"

else:
                    print("short 5")
                    print("current pos must be already sold - check!")
                    action = "Already Short!"
                    self.value_2 = self.sell_units * tick2
                    self.value_1 = self.buy_units * tick1

else:
                print("short 6")
                print("prev signal must be Short - check!")
                action = "Already Short!"
                self.value_2 = self.sell_units * tick2
                self.value_1 = self.buy_units * tick1


            
#logic for Long signal            
        
        elif self.signal =='Long':
            
            print("captured by Long")
            sell_units = single_trade_amount/tick1
            buy_units = single_trade_amount/tick2
            
            if self.signal == 'Long' and self.signal != self.prev:
                print("long 1")
                if self.current_pos == 'sold':
                    self.value_2 = self.sell_units * tick1
                    self.value_1 = self.buy_units * tick2
                    self.close_out() 
                    action = "short => close out"
                elif self.current_pos == "empty":
                    
                    print("long 2")


                    print("Went Long: sold",sell_units,"units of b*BCH","at a price of",tick1, "and bought",buy_units,"of BTC at a price of",tick2)
                    #self.sold_value = sell_units*tick1
                    #self.bot_value = buy_units*tick2
                    self.sell_units = sell_units
                    self.buy_units = buy_units
                    self.value_2 = self.sell_units * tick1
                    self.value_1 = self.buy_units * tick2

self.bought_sold_price = tick2 - tick1
                    self.current_pos = 'bought'
                    action = "Went Long!"
                    print("should be 1000", single_trade_amount)
                    print("tot trade amount", tot_trade_amount)

else:
                    print("long 5")
                    print("current pos must be already long - check")
                    action = "Already Long!"
                    self.value_2 = self.sell_units * tick1
                    self.value_1 = self.buy_units * tick2
            else:
                print("long 6")
                print("prev signal must be long - check!")
                action = "Already Long!"
                self.value_2 = self.sell_units * tick1
                self.value_1 = self.buy_units * tick2

else:
            print("not captured 1")
            print("not captured by buy sell or hold need to fix!")
            
            
        print(self.sell_units)
        print(self.buy_units)
        print(ts)
        #print("tick1: ", tick1, "tick2: ", tick2)

#calculate unrealized pnl and finally update _port()

        urpl = (1000 - self.value_2) + (self.value_1 - 1000)
        self._port.loc[len(self._port)] = [ts,self.signal,action,self.value_2,self.value_1,urpl,self.rpl]

self.prev = self.signal
        self.stamp+=1

Lets remind ourselves of the ‘spread’ that we are trading. Any time the price moves above the top blue line, we short the spread by selling BTC and buying b*BCH. If the price goes below the lower blue line we do the opposite; buy BCT and sell b*BCH. When the spread value crosses the mean (red line) we close out any position we may have.

Remember there is one thing missing from this trading strategy-stop loss!

This code generates a plot to show the ‘spread’ that we are trading. I’ve added the index number of the trades for reference:

#shows the spread we are trading with mean (red line) and +- 1std (blue lines)

_mu = np.mean(q.full_z_coint)
plt.plot(q.full_z_coint)
plt.axhline(np.mean(q.full_z_coint),color='r')
plt.axhline(_mu+np.std(q.full_z_coint),color='b')
plt.axhline(_mu-np.std(q.full_z_coint),color='b')

#plot every 5th index for debugging and reference
for i ,txt in enumerate([x for x in range(len(q))]):
    if i%5==0:
        plt.annotate(txt,(q.index[i],q.full_z_coint[i]))
    
print('mu',np.mean(q.full_z_coint))
print('std',np.std(q.full_z_coint))

The final component is the Strategy class. This executes the event driven backtester by pulling one row of data at a time from the Data_Puller class, passing it through the backtester logic handled by the Portfolio class via the position() method. This is the core of an event driven backtester. Rather that simply vectorizing the whole time series and applying the trading rules to all points at the same time, we simulate what would happen with our strategy tick-by-tick. This gives us a much more realistic simulation and an expandable framework whereby we could add in functionality to account for transaction costs, slippage, liquidity, microstructure events etc, into our backtest.

#create strategy to perform on any pair.
class Strategy(Portfolio):
    
    def __init__(self):
        #use Super to get Portfolio attrs
        Portfolio.__init__(self)
        #price_feed = Data_Puller().fetch_df()
        self.sdev = np.std(q.full_z_coint)
        self.mu = np.mean(q.full_z_coint)
        
    
    #go long / short if +- 1 std, sell when hit mean
    def strat(self):
        
        while q.empty==False:
        
            
            #print('running...')
            #pop .loc and drop it...
            btc,bch,ts,z_coint,b_x_bch = q.iloc[0]
            q.drop(q.head(1).index,inplace=True)
                        
            #compare to plus / minus 1 stdev -> generate signal
            if z_coint > self.mu + self.sdev:
                #self.orders.loc[len(self.orders)] = [ts,'Short',btc,bch]
                self.signal = 'Short'
                self.position(ts,b_x_bch,btc,z_coint)
                
            elif z_coint < self.mu - self.sdev:
                #self.orders.loc[len(self.orders)] = [ts,'Long',btc,bch]
                self.signal = 'Long'
                self.position(ts,b_x_bch,btc,z_coint)
                
                            
            
            else:
                #self.orders.loc[len(self.orders)] = [ts,'Hold',btc,bch]
                self.signal = 'Hold'
                self.position(ts,b_x_bch,btc,z_coint)
                
            
            
            
            
            #print(self.current_position)
        
        print('Finished!')
        
            
#function to return tick by tick printout and R_pnl chart               
    def get_portfolio(self):
        self._port.set_index('ts',inplace=True)
        plt.plot(self._port.R_pnl)
        plt.show()
        pd.set_option('display.max_rows', 400)
        return self._port.head(360)
        #return self._port

Lets run the backtester and see what happens. First we instantiate the Strategy class and run the strat() method. This will start printing out a real time tick by tick display of various bits of information to show what the backtesting engine is doing:

p = Strategy()

p.strat()

backtester prints out tick-by-tick events

sample of print out as backtester runs

Here’s a selection of what’s included in the print out:

index

previous signal (for debugging)

captured by shows which logic is triggered (short/long/hold) again for debugging,

short1, short2 references which part of the ‘short’ logic is activated

a string printout of what we’ve bought and sold

the two floating point numbers represent the ‘sell and buy units’ that is the amount of BTC and BCH that we buy given that each new position is always $1000 long and $1000 short.

finally we print out a plot showing the realized pnl and a pandas dataframe showing all the tick-by-tick events that have happened:

p.get_portfolio()

plot of Realized pnl in $

final printout of all events in backtester

We made a profit! But remember what we said earlier, the way this algo is structured means that there are no stop losses, only take profit signals (assuming that stationarity holds).

Backtesting pitfalls

The purpose of a backtester, is that it is a historical simulation of how a strategy would have performed.

Backtesting is one of the most misunderstood concepts in finance. In financial literature it is done badly, with many authors committing structural and statistical errors in their backtester. Below is a list of the “7 sins of quantitative investing” by Luo et al [2014].

Survivorship bias — using an investment universe that doesnt include companies that went bankrupt / delisted. The S&P500 today is different that 10yrs ago.
Look-ahead bias — using data that was not available at time of the simulation.
Storytelling — justifying the results after the event or simply selecting the data that fits your predetermined ‘story’.
Data snooping — incorporating test data in training data.
Transaction costs — simple backtesters don’t account for slippage, costs, fees etc.
Outliers — using extreme results with a low probability of ever occurring again.
Shorting — related to transaction costs, the cost of selling short is unknown unless you actually made the trade.

Glancing through this list you may notice something…We have committed most of these sins!

Not only did we fall prey to some of these pitfalls we also fell for many more! As an example, remember at the beginning of our analysis we arbitrarily split the data into training | test sets ? Well, look what happens when we shift our training set window forward by jut 2 weeks…its no longer cointegrated!

#shift training set window forward by 2 weeks
shifted_train_btc, shifted_train_bch = df['btc'].loc['2017-12-26':'2018-5-13'], df['bch'].loc['2017-12-26':'2018-5-13']

t,p,crit = coint(shifted_train_btc,shifted_train_bch)

#test for significance
print(p)
if p <0.05:
    print('Cointegrated!')
else:
    print('NOT Cointegrated')

As another example, our original linear model based on the training set we know to be stationary, but what about the test set? after all, this is what counts when running a backtester:

The test set is not stationary!

#run adf again, this time on Test set

plt.plot(z)
z_pval = adfuller(z)[1]
if z_pval<0.01:
    print(z_pval,"Huzzah!, it's Stationary")
else:
    print(z_pval,":Not stationary")
plt.axhline(z.mean(),color='r')
plt.axhline(z.mean() + z.std(),color='b')
plt.axhline(z.mean() - z.std(),color='b')

test set model is NOT stationary!

test set model is NOT cointegrated!

Clearly we have committed a number of mistakes when it comes to building a statistically robust backtester. The main issues in our case, are based around the statistical properties of time-series data. In particular, that one window of data can have a particular distribution, but another (close by) window can be completely different!

There is a labrinthine rabbit hole we could go down here with backtesting pitfalls, but this post is long enough already.

Next steps

We have learnt the statistical technique of cointegration along with stationarity and time series. We then learned how to test for it using python statsmodels functions coint and adfuller.

We constructed a hypothesis for why BTC and BCH might be cointegrated, test for it and build a non-trivial, event driven backtester using python.

Finally we looked at what we did wrong and the potential pitfalls when conduction backtesting.

Now you should run the backtester using either the jupyter notebook or directly in the terminal using the .py file which can both be found here.

Patrick-David/BitcoinBacktester

For further exploration, try using some of the other crypto pairs available on the cryptocompare API to see if they are cointegrated.
You could extend the logic of the Portfolio class to account for transaction costs.
You could try a rolling window when testing for stationarity, to ensure the whole series is stationary, not just a select window.
Currently the logic executes a new buy/sell trade simultaneously to when the signal was generated, you may want to try generating a buy/sell signal at close of play on one day, then executing the trade the next day ( this isn’t necessary for crypto which is 24/7 but equities have a fixed trading day).
I’d love to hear your thoughts on any of the topics discussed!
Read my other blog posts on statistical testing…

Stocks, Significance Testing & p-Hacking: How volatile is volatile?

if you are interested in Deep Learning and want to learn how backpropagation works, check out this tutorial…

All the Backpropagation derivatives

Follow me for more on quant finance, deep learning and more!
Say hi on twitter at twitter.com/pdquant

Patrick David (@pdquant) | Twitter

Stocks, Significance Testing & p-Hacking: How volatile is volatile?

Patrick David — Fri, 19 Oct 2018 21:39:53 GMT

October is historically the most volatile month for stocks, but is this a persistent signal or just noise in the data?

Ave Monthly Vol Ranking

Stocks, Significance Testing & p-Hacking.

Over the past 32 years, October has been the most volatile month on average for the S&P500 and December the least, in this article we will use simulation to assess the statistical significance of this observation and to what extent this observation could occur by chance. All code included!

This post is now live on DataCamp — click here!

Our goal:

Demonstrate how to use Pandas to analyze Time Series
Understand how to construct a hypothesis test
Use simulation to perform hypothesis testing
Show the importance of accounting for multiple comparison bias

Our data:

We will be using Daily S&P500 data for this analysis, in particular we will use the raw daily closing prices from 1986 to 2018 (which is surprisingly hard to find so I’ve made it publicly available).

The inspiration for this post came from Winton, which we will be reproducing here, albeit with 32 years of data vs their 87 years.

For those on Kaggle I’ve created an interactive Kernel — give it an upvote!

Wrangle with Pandas:

To answer the question of whether the extreme volatility seen in certain months really is significant, we need to transform our 32yrs of price data into a format that shows the phenomena we are investigating.

Our format of choice will be average monthly volatility rankings (AMVR).

The following code shows how we get our raw price data into this format:

#standard imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
%matplotlib inline
import seaborn as sns

A nice trick to make charts display full width in Jupyter notebooks:

#resize charts to fit screen
plt.rcParams[‘figure.figsize’]=[15,5]

import our data and convert into daily returns using pct_change

#Daily S&P500 data from 1986==>2018
url = "https://raw.githubusercontent.com/Patrick-David/AMVR/master/spx.csv"
df = pd.read_csv(url,index_col='date', parse_dates=True)

#To model returns we will use daily % change
daily_ret = df['close'].pct_change()

#drop the 1st value which is a NaN
daily_ret.dropna(inplace=True)

#daily %change
daily_ret.head()

Now we use one of the more powerful tools in Pandas: Resample. This allows us to change the frequency of our data from daily to monthly and to use standard deviation as a measure of volatility. These are the design choices we have when constructing an analysis.

#use pandas to resample returns per month and take standard deviation as measure of Volatility
#then annualize by multiplying by sqrt of number of periods (12)
mnthly_annu = daily_ret.resample(‘M’).std()* np.sqrt(12)

print(mnthly_annu.head())

#we can see major market events show up in the volatility
plt.plot(mnthly_annu)
plt.axvspan(‘1987’,’1989',color=’r’,alpha=.5)
plt.axvspan(‘2008’,’2010',color=’r’,alpha=.5)
plt.title(‘Monthly Annualized vol — Black Monday and 2008 Financial Crisis highlighted’)
labs = mpatches.Patch(color=’red’,alpha=.5, label=”Black Monday & ’08 Crash”)
plt.legend(handles=[labs])

A quick look at the annualized monthly vol, shows major market events clearly, such as Black Monday and the 2008 Financial Crisis.

Here’s where we can use the power of pandas to group our volatility by year and create a ranking for each of the 12 months over all 32 years of data

#for each year, rank each month based on volatility lowest=1 Highest=12
ranked = mnthly_annu.groupby(mnthly_annu.index.year).rank()

#average the ranks over all years for each month
final = ranked.groupby(ranked.index.month).mean()

final.describe()

this gives our final Average Monthly Volatility Rankings. Numerically we can see that month 10 (October) is the highest and 12 (December) is the lowest.

#the final average results over 32 years 
final

By plotting our AMVR we can clearly see the most volatile month has been October and the lowest, December.

#plot results for S&P AMVR: clearly October has the highest ave vol rank and December has the lowest. Mean of 6.45 is plotted

b_plot = plt.bar(x=final.index,height=final)
b_plot[9].set_color(‘g’)
b_plot[11].set_color(‘r’)
for i,v in enumerate(round(final,2)):
 plt.text(i+.8,1,str(v), color=’black’, fontweight=’bold’)
plt.axhline(final.mean(),ls=’ — ‘,color=’k’,label=round(final.mean(),2))
plt.title(‘Average Monthly Volatility Ranking S&P500 since 1986’)

plt.legend()
plt.show()

Ave Monthly Vol Ranking

So that’s our data, now onto Hypothesis testing…

Hypothesis Testing: What’s the question?

Hypothesis testing is one of the most fundamental techniques of data science, yet it is one on the most intimidating and misunderstood. The basis for this fear, is the way it is taught in Stats 101, where we are told to:

perform a t-test, is it a one-sided or two-sided test?, choose a suitable test-statistic such as Welch’s t-test, calculate degrees of freedom, calculate the t score, look up the critical value in a table, compare critical value to t statistic ……

Understandably, this is leads to confusion over what test to conduct and how to conduct it. However, all of these classical statistical techniques for performing a hypothesis test, were developed at a time when we had very little computing power and were simply closed-form analytical solutions for calculating a p-value,that's it! But with the added complication of needing to pick the right formula for the given situation, due to their restrictive and sometimes opaque assumptions.

But rejoice!

There is a better way. Simulation.

To understand how simulation can help us, lets remind ourselves what a hypothesis test is:

We wish to test “whether the observed effect in our data is real or whether it could happen simply by chance”.

and to perform this test we do the following:

Choose an appropriate ‘test statistic’: this is simply a number that measures the observed effect. In our case we will choose the absolute deviation in AMVR from the mean.
Construct a Null Hypothesis: this is simply a version of the data where the observed effect in not present. In our case we will shuffle the labels of the data repeatedly (permutation). The justification for this is detailed below.
Compute a p-value: this is the probability of seeing the observed effect amongst the null data, in other words, by chance. We do this through repeated simulation of the null data. In our case, we shuffle the ‘date’ labels of the data many times and simply count the occurrence of our test statistic as it appears through multiple simulations.

That’s hypothesis testing in 3 steps! No matter what phenomena we are testing, the question is always the same: “is the observed effect real, or is it due to chance”

There is only one test! This great blog by Allen Downey has more details on hypothesis testing

The real power of simulation is that we have to make explicit what our model assumptions are through code. Whereas classical techniques can be a ‘black-box’ when it comes to there assumptions.

Example: The left plot shows the true data and the observed effect with a certain probability (green). The right plot is our simulated null data with a recording of when the observed effect was seen by chance (red). This is the basis of hypothesis testing, what is the probability of seeing the observed effect in our null data.

The most important part of hypothesis testing is being clear what question we are trying to answer. In our case we are asking:

“Could the most extreme value happen by chance?”

The most extreme value we define as the greatest absolute AMVR deviation from the mean. This question forms our null hypothesis.

In our data the most extreme value is the December value (1.23) not the October value (1.08), because we are looking at the biggest absolute deviation from the mean not simply the highest volatility:


#take absolute AMVR from the mean, we see Dec and Oct are the biggest absolute moves (Oct to the upside, Dec to the downside), with Dec being the greatest.

fin = abs(final — final.mean())
print(fin.sort_values())
Oct_value = fin[10]
Dec_value = fin[12]
print(‘Extreme Dec value:’, Dec_value)
print(‘Extreme Oct value:’, Oct_value)

Simulation

Now we know what question we are asking, we need to construct our ‘Null Model’.

There are a number of options here:

Parametric models. If we had a good idea of the data distribution or simply made assumption on it, we could use ‘classical’ hypothesis testing techniques, t-test, X², one-way ANOVA etc. These models can be restrictive and something of a blackbox if their assumptions aren’t fully understood by the researcher.
Direct Simulation. We could make assumptions about the data generating process and simulate directly. For example we could specify an ARMA time series model for the financial data we are modeling and deliberately engineer it to have no seasonality. This could be a reasonable choice for our problem, however if we knew the data generating process for the S&P500 we would be rich already!
Simulation through Resampling. This is the approach we will take. By repeatedly sampling at random from the existing dataset and shuffling the labels, we can make the observed effect equally likely amongst all labels in our data (in our case the labels are the dates), thus giving the desired null dataset.

Sampling is a big topic but we will focus on one particular technique, permutation or shuffling.

To get the desired null model, we need to construct a dataset that has no seasonality present. If the null is true, that there is no seasonality in the data and the observed effect was simply by chance, then the labels for each month (Jan,Feb etc) are meaningless and therefore we can shuffle the data repeatedly to build up what classical statistics would call the ‘sampling distribution of the test statistic under the null hypothesis’. This has the desired effect of making the observed phenomena (the extreme December value) equally likely for all months, which is exactly what our null model requires.

To prove how powerful simulation techniques are with modern computing power, the code in this example will actually permute the daily price data, which requires lots more processing power, yet still completes in seconds on a modern CPU. Note: shuffling either the daily or the monthly labels will give us the desired null dataset in our case.

Shuffle ‘date’ labels to create null dataset

A great resource for learning about sampling is by Julian Simon.

Note: The way our test is constructed is the equivalent of a two sided test using ‘classical’ methods, such as Welch’s t-test or ANOVA etc, because we are interested in the most extreme value, either above or below the mean.

These decisions are design choices and we have this freedom because the Null Model is just that, it’s a Model! This means we can specify its parameters as we choose, the key is to really be clear what question we are trying to answer.

Below is the code to simulate 1000 sets of 12 AMVR, permuting the date labels each time to build up the sampling distribution. The output from this code is included below in the p-hacking section…

#as our Null is that no seasonality exists or alternatively that the month / day does not matter in terms of AMVR, we can shuffle 'date' labels.
#for simplicity, we will shuffle the 'daily' return data, which has the same effect as shuffling 'month' labels

#generate null data

new_df_sim = pd.DataFrame()
highest_only = []

count=0
n=1000
for i in range(n):
    #sample same size as dataset, drop timestamp
    daily_ret_shuffle = daily_ret.sample(8191).reset_index(drop=True)
    #add new timestamp to shuffled data
    daily_ret_shuffle.index = (pd.bdate_range(start='1986-1-3',periods=8191))
    
    #then follow same data wrangling as before...
    mnthly_annu = daily_ret_shuffle.resample('M').std()* np.sqrt(12)
    
    ranked = mnthly_annu.groupby(mnthly_annu.index.year).rank()
    sim_final = ranked.groupby(ranked.index.month).mean()
    #add each of 1000 sims into df
    new_df_sim = pd.concat([new_df_sim,sim_final],axis=1)
    
    #also record just highest AMVR for each year (we will use this later for p-hacking explanation)
    maxi_month = max(sim_final)
    highest_only.append(maxi_month)

#calculate absolute deviation in AMVR from the mean
all_months = new_df_sim.values.flatten()
mu_all_months = all_months.mean()
abs_all_months = abs(all_months-mu_all_months)

#calculate absolute deviation in highest only AMVR from the mean
mu_highest = np.mean(highest_only)
abs_highest = [abs(x - mu_all_months) for x in highest_only]

p-Hacking

Here’s the interesting bit. We’ve constructed a hypothesis to test, we’ve generated simulated data by shuffling the ‘date’ labels of the data , now we need to perform our hypothesis test to find the probability of observing a result as significant as the December result given that the null hypothesis (no seasonality) is true.

Before we perform the test lets set our expectations.

Whats the probability of seeing at least one significant result given a 5% significance level?

= 1-p(not significant)

= 1-(1–0.05)¹²

= 0.46

so there’s a 46% chance of seeing at least one month with a significant result, given our null is true.

Now lets ask, for each individual test (comparing each of the 12 months absolute AMVR to the mean) how many significant values should we expect to see amongst our random, non-seasonal data?

12 x 0.05 = 0.6

So with a 0.05 significance level we should expect a false positive rate of 0.6. In other words, for each test (with the null data) comparing all 12 months AMVR to the mean, 0.6 months will have show a significant result.(obviously we cant have less than 1 month showing a result, but under repeat testing the math should tend towards this number).

All the way through this work, we have stressed the importance of being really clear with the question we are trying to answer. The problem with the expectations we’ve just calculated is we have assumed we are testing for a significant result against all 12 months! That’s why the probability of seeing at least one false positive is so high at 46%.

This is an example of multiple comparison bias. Where we have expanded our search space and increased the likelihood of finding a significant result. This a problem because we could abuse this phenomena to cherry pick the parameters of our model which give us the ‘desired’ p-value.

This is the essence of p-hacking.

This xkcd nicely highlights the issue with multiple comparison bias and it’s subsequent abuse, p-hacking.

Whoa! Green Jelly beans are significant

To illustrate the effect of p-hacking and how to reduce multiplicity, we need to understand the subtle but significant difference between the following 2 question:

“Whats the probability that December would appear this extreme by chance.”
“Whats the probability any month would appear this extreme by chance.”

The beauty of simulation lies in its simplicity. The following code is all we need to compute the p-value to answer the 1st question. We simply count how many values in our dataset using all 12000 AMVR deviations (12 months x 1000 trials) are greater than the observed December value. We get a p-value of 4.4%, close to our somewhat arbitrary 5% cut off, but significant none the less.

#count number of months in sim data where abs AMVR is > Dec
#Here we are comparing against ALL months
count=0
for i in abs_all_months:
    if i> Dec_value:
        count+=1
ans = count/len(abs_all_months)        
print('p-value:', ans )

p-value: 0.044

To answer the 2nd question and to avoid multiplicity, instead of comparing our result to the distribution made with all 12000 AMVR deviations, we only consider the highest value from each of the absolute AMVR 1000 trials. This gives a p-value of 23%, very much not significant!

#same again but just considering highest abs AMVR for each of 1000 trials
count=0
for i in abs_highest:
    if i> Dec_value:
        count+=1
ans = count/len(abs_highest)        
print('p-value:', ans )

p-value: 0.234

Now lets plot these distributions.

#calculate 5% significance 
abs_all_months_95 = np.quantile(abs_all_months,.95)
abs_highest_95 = np.quantile(abs_highest,.95)

#plot the answer to Q1 in left column and Q2 in right column

fig, ((ax1,ax2),(ax3,ax4)) = plt.subplots(2,2,sharex='col',figsize=(12,12))

#plot 1
ax1.hist(abs_all_months,histtype='bar')
ax1.set_title('AMVR all months')
ax1.set_ylabel('Frequency')
n,bins,patches = ax3.hist(abs_all_months,density=1,histtype='bar',cumulative=True,bins=30)
ax3.set_ylabel('Cumulative probability')
ax1.axvline(Dec_value,color='b',label='Dec Result')
ax3.axvline(Dec_value,color='b')
ax3.axvline(abs_all_months_95,color='r',ls='--',label='5% Sig level')

#plot2
ax2.hist(abs_highest,histtype='bar')
ax2.set_title('AMVR highest only')
ax2.axvline(Dec_value,color='b')
n,bins,patches = ax4.hist(abs_highest,density=1,histtype='bar',cumulative=True,bins=30)
ax4.axvline(Dec_value,color='b')
ax4.axvline(abs_highest_95,color='r',ls='--')

ax1.legend()
ax3.legend()

The left column is the data that answers question 1 and the right column, question 2. The top row are the probability distributions and the bottom row are the CDF. The red dashed line is the 5% significance level that we arbitrarily decided upon. The blue line is the original extreme December AMVR value of 1.23.

probability distributions for Q1 and Q2

The left side plot shows that the original December value is significant at a 5% level, but only just! However, when we account for multiple comparison bias, in the right hand plot the threshold for significance moves up from around 1.2 (abs AMVR) up to around 1.6 (see the redline).

By accounting for multiple comparison bias our December value of 1.23 is no longer significant!

By taking into consideration the specific question we are trying to answer and avoiding multiple comparison bias, we have avoided p-hacking our model and avoided showing a significant result when there isn’t one.

To further explore p-hacking and how it can be abused to tell a particular story about our data, see this great interactive app from FiveThirtyEight…

Science Isn't Broken

Conclusions

We have learnt that hypothesis testing is not the big scary beast we thought it was. Simply follow the 3 steps above to construct your model for any kind of data or test statistic.

We’ve show that asking the right question is vital for scientific analysis. A slight change in the wording can lead to a very different model with very different results.

We discussed the importance of recognizing and correcting for multiple comparison bias and avoiding the pitfalls of p-hacking and showed how a seemingly significant result can become non-significant.

With more and more ‘big data’ along with academic pressure to produce a research paper with ‘novel’ findings or political pressure to show a result as being ‘significant’, the temptation for p-hacking is ever increasing. By learning to recongise when we are guilty of it and correcting for it accordingly we can become better researchers and ultimately produce more accurate and therefore actionable scientific results!

Authors Notes: Our results differ slightly from the original Winton research, this due in part to having a slightly different data set (32yrs vs 87yrs) and they have October being the month of interest whereas we have December. Also they used an undisclosed method for their ‘simulated data’ whereas we have made explicit, through code our methodology for creating that data. We have made certain modeling assumptions through out this work, again, these have been made explicit and can be seen in the code. These design and modeling choices are part of the scientific process, so long as they are made explicit, the analysis has merit.

Hope you found this useful and interesting, follow to be notified with my latest posts!

Follow twitter.com/pdquant for more!

All the Backpropagation derivatives

Patrick David — Thu, 07 Jun 2018 19:06:13 GMT

So you’ve completed Andrew Ng’s Deep Learning course on Coursera,

You know that ForwardProp looks like this:

Forwardpropagation Equations

And you know that Backprop looks like this:

Backprop Equations

But do you know how to derive these formulas?

TL;DR

Full derivations of all Backpropagation derivatives used in Coursera Deep Learning, using both chain rule and direct computation.

If you’ve been through backpropagation and not understood how results such as

The derivative of our linear function - dz

and

derivative of Cost w.r.t activation ‘a’

are derived, if you want to understand the direct computation as well as simply using chain rule, then read on…

Our Neural Network

Neural Net taken from Coursera Deep Learning.

This is the simple Neural Net we will be working with, where x,W and b are our inputs, the “z’s” are the linear function of our inputs, the “a’s” are the (sigmoid) activation functions and the final

Cross Entropy cost function

is our Cross Entropy or Negative Log Likelihood cost function.

So here’s the plan, we will work backwards from our cost function

Our cost function

and compute directly, the derivative of

Cross Entropy cost function

with respect to (w.r.t) each of the preceding elements in our Neural Network:

The derivatives of L(a,y) w.r.t each element in our NN

As well as computing these values directly, we will also show the chain rule derivation as well.

# Note: we don’t differentiate our input ‘X’ because these are fixed values that we are given and therefore don’t optimize over.

[1] Derivative w.r.t activation function

[1] derivative of our activation function

So to start we will take the derivative of our cost function

w.r.t the activation function

Activation function 2

So we are taking the derivative of the Negative log likelihood function (Cross Entropy) , which when expanded looks like this:

Taking derivative of our cost function

First lets move the minus sign on the left of the brackets and distribute it inside the brackets, so we get:

distribute minus sign

Next we differentiate the left hand side:

l.h.s

The right hand side is more complex as the derivative of ln(1-a) is not simply 1/(1-a), we must use chain rule to multiply the derivative of the inner function by the outer.

the derivative of a log

The derivative of (1-a) = -1, this gives the final result:

derivative of L w.r.t activation ‘a’

final result

And the proof of the derivative of a log being the inverse is as follows:

proof for the derivative of a log

[2] Derivative of sigmoid

[2] derivative of sigmoid

It is useful at this stage to compute the derivative of the sigmoid activation function, as we will need it later on.

our logistic function (sigmoid) is given as:

Sigmoid (Logistic) function

First is is convenient to rearrange this function to the following form, as it allows us to use the chain rule to differentiate:

Rearranged sigmoid function

Now using chain rule: multiplying the outer derivative by the inner, gives

outer derivative x inner derivative

which rearranged gives

put RHS over LHS

Here’s the clever part. We can then separate this into the product of two fractions and with a bit of algebraic magic, we add a ‘1’ to the second numerator and immediately take it away again:

add a ‘1’ and subtract a ‘1’ on RHS

The RHS then simplifies to

Which is nothing more than

1 minus our sigmoid

Which gives a final result of

Or alternatively:

This notation will be easier

[3] Derivative w.r.t linear function

[3] derivative of our linear function (z = wX + b)

To get this result we can use chain rule by multiplying the two results we’ve already calculated [1] and [2]

multiply derivative [1] by derivative [2]

der[1] x der[2]

So if we can get a common denominator in the left hand of the equation, then we can simplify the equation, so lets add ‘(1-a)’ to the first fraction and ‘a’ to the second fraction

add ‘(1–a)’ and ‘a’ to get common denominator

with a common denominator we can simplify to

common denominator

now we multiply LHS by RHS, the a(1-a) terms cancel out and we are left with just the numerator from the LHS!

the remaining numerator

which if we expand out gives:

expanded out

note that ‘ya’ is the same as ‘ay’, so they cancel to give

which rearranges to give our final result of the derivative

[3]

our final result is

derivative of our linear function (z = wX +b)

[4] Derivative w.r.t weights

[4] derivative of linear func ‘z’ w.r.t weights ‘w’

This derivative is trivial to compute, as z is simply

linear function ‘z’

and the derivative simply evaluates to

derivative of ‘z’ w.r.t ‘w’

[5] Derivative w.r.t weights (2)

[5] derivative of cost func w.r.t weights ‘w’

This derivative can be computed two different ways! We can use chain rule or compute directly. We will do both as it provides a great intuition behind backprop calculation.

To use chain rule to get derivative [5] we note that we have already computed the following

previously computed

Noting that the product of the first two equations gives us

if we then continue using the chain rule and multiply this result by

then we get

which is nothing more than

The final result for ‘dw’

or written out long hand

chain rule result for ‘dw’

So that’s the ‘chain rule way’. Now lets compute ‘dw’ directly:

To compute directly, we first take our cost function

Cross Entropy cost function

We can notice that the first log term ‘ln(a)’ can be expanded to

expanding ‘ln(a)’

Which simplifies to:

And if we take the second log function ‘ln(1-a)’ which can be shown as

ln(1-a)

taking the log of the numerator ( we will leave the denominator) we get

log of the numerator

This result comes from the rule of logs, which states: log(p/q) = log(p) — log(q).

Plugging these formula back into our original cost function we get

plugged back into cost function

Expanding the term in the square brackets we get

terms inside bracket expanded

The first and last terms ‘yln(1+e^-z)’ cancel out leaving:

Which we can rearrange by pulling the ‘yz’ term to the outside to give

Here’s where it gets interesting, by adding an exp term to the ‘z’ inside the square brackets and then immediately taking its log

we exponentiate ‘e^z’ then take its log

next we can take advantage of the rule of sum of logs: ln(a) + ln(b) = ln(a.b) combined with rule of exp products:e^a * e^b = e^(a+b) to get

ln(a) + ln(b) = ln(a.b)

followed by

add e^(z +-z)

Pulling the ‘yz’ term inside the brackets we get :

Finally we note that z = Wx+b therefore taking the derivative w.r.t W:

take derivative w.r.t W

The first term ‘yz ’becomes ‘yx ’and the second term becomes :

taking derivative of logs again

Note that the 2nd term is nothing but

Which gives a final result of

We can rearrange by pulling ‘x’ out to give

which gives

final result

[6] derivative w.r.t bias

[6] derivative w.r.t bias b

Again we could use chain rule which would be

chain rule for ‘db’

This is easy to solve as we already computed ‘dz’ and the second term is simply the derivative of ‘z’ which is ‘wX +b’ w.r.t ‘b’ which is simply 1!

so the derivative w.r.t b is simply

which we already calculated earlier as

derivative of our linear function (z = wX +b)

For completeness we will also show how to calculate ‘db’ directly. To calculate this we will take a step from the above calculation for ‘dw’, (from just before we did the differentiation)

note: z = wX + b

remembering that z = wX +b and we are trying to find derivative of the function w.r.t b, if we take the derivative w.r.t b from both terms ‘yz’ and ‘ln(1+e^z)’ we get

note the parenthesis

its important to note the parenthesis here, as it clarifies how we get our derivative.

Taking the LHS first, the derivative of ‘wX’ w.r.t ‘b’ is zero as it doesn’t contain b! The derivative of ‘b’ is simply 1, so we are just left with the ‘y’ outside the parenthesis.

for the RHS, we do the same as we did when calculating ‘dw’, except this time when taking derivative of the inner function ‘e^wX+b’ we take it w.r.t ‘b’ (instead of ‘w’) which gives the following result (this is because the derivative w.r.t in the exponent evaluates to 1)

derivative of ln(1+e^wX+b)

this term is simply our original

so putting the whole thing together we get

final result

which we have already show is simply ‘dz’!

‘db’ = ‘dz’

So that concludes all the derivatives of our Neural Network. We have calculated all of the following:

The derivatives of L(a,y) w.r.t each element in our NN

Wrapping up

And what about the result:

well, we can unpack the chain rule to explain:

‘dz’ using chain rule

Note that the term

is simply ‘dz’ the term we calculated earlier:

and the term

evaluates to W[l] or in other words, the derivative of our linear function Z =’Wa +b’ w.r.t ‘a’ equals ‘W’.

and finally the term in blue

is simply

[2] derivative of sigmoid

‘da/dz’ the derivative of the the sigmoid function that we calculated earlier!

As a final note on the notation used in the Coursera Deep Learning course, in the result

we perform element wise multiplication between DZ and g’(Z), this is to ensure that all the dimensions of our matrix multiplications match up as expected.

So there we have it…

… all the derivatives required for backprop as shown in Andrew Ng’s Deep Learning course.

Simply reading through these calculus calculations (or any others for that matter) won’t be enough to make it stick in your mind. The best way to learn is to lock yourself in a room and practice, practice, practice!

What next?

If you got something out of this post, please share with others who may benefit, follow me Patrick David for more ML posts or on twitter @pdquant and give it a cynical/pity/genuine round of applause!

Stocks Significance Testing & p-Hacking

Stocks, Significance Testing & p-Hacking: How volatile is volatile?

Build a Bit(Cointegration) Backtester

Build a BitCoin(tegration) Backtester