<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:cc="http://cyber.law.harvard.edu/rss/creativeCommonsRssModule.html">
    <channel>
        <title><![CDATA[Stories by Patrick David on Medium]]></title>
        <description><![CDATA[Stories by Patrick David on Medium]]></description>
        <link>https://medium.com/@pdquant?source=rss-55da8ebe8a7f------2</link>
        <image>
            <url>https://cdn-images-1.medium.com/fit/c/150/150/1*Ir1bzCuraZZV4KJjOxUyWA@2x.jpeg</url>
            <title>Stories by Patrick David on Medium</title>
            <link>https://medium.com/@pdquant?source=rss-55da8ebe8a7f------2</link>
        </image>
        <generator>Medium</generator>
        <lastBuildDate>Sat, 11 Apr 2026 12:06:24 GMT</lastBuildDate>
        <atom:link href="https://medium.com/@pdquant/feed" rel="self" type="application/rss+xml"/>
        <webMaster><![CDATA[yourfriends@medium.com]]></webMaster>
        <atom:link href="http://medium.superfeedr.com" rel="hub"/>
        <item>
            <title><![CDATA[Stop using ChatGPT]]></title>
            <link>https://medium.com/@pdquant/stop-using-chatgpt-bdb059c806b3?source=rss-55da8ebe8a7f------2</link>
            <guid isPermaLink="false">https://medium.com/p/bdb059c806b3</guid>
            <category><![CDATA[ai]]></category>
            <category><![CDATA[chatgpt]]></category>
            <category><![CDATA[software-development]]></category>
            <category><![CDATA[productivity]]></category>
            <category><![CDATA[artificial-intelligence]]></category>
            <dc:creator><![CDATA[Patrick David]]></dc:creator>
            <pubDate>Thu, 19 Jun 2025 16:56:35 GMT</pubDate>
            <atom:updated>2025-06-19T16:56:35.603Z</atom:updated>
            <content:encoded><![CDATA[<p>… There is a better way …</p><h3>Welcome to GPT Island 🏝️</h3><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*Ud7RzNyf18yp6o_QwmW1ng.jpeg" /><figcaption><a href="http://gptisland.com">gptisland.com</a></figcaption></figure><h3>Stop Using ChatGPT - Start Using GPT Island 🏝️.</h3><p>Let’s be honest, ChatGPT is an amazing AI but a not-so-amazing UI — tab switching, clunky interface, waiting around for answers, no control over context memory, unpredictable token limits, increasingly cluttered UI, breaking your flow with constant context shift …</p><h3><em>… there is a better way</em></h3><p>It’s called <strong>GPT Island </strong>🏝️ it keeps all the best of ChatGPT and solves all the aforementioned problems.</p><p><strong>You will never use ChatGPT.com again.</strong></p><h3>So… what is GPT Island?</h3><p>GPT Island 🏝️ is a beautiful Chrome extension that brings the power of ChatGPT (and DeepSeek) directly to <strong>every webpage</strong>. It places a beautiful chat UI at the bottom of every page with a set of unique features to make interacting with ChatGPT a joy.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*w5cS5kFYKnwNIUPaLgO6zA.jpeg" /><figcaption>GPT Island in action</figcaption></figure><h3>Why GPT Island 🏝️ is Better Than ChatGPT</h3><h3>🖼️ 1. Persistent UI — Wherever You Are</h3><p>GPT Island 🏝️ adds a beautifully designed, intuitive chat interface right onto every webpage. Reading an article? studying for exams? Doing research? Just start typing — GPT Island🏝 is <em>always there</em>, without interrupting your flow.</p><h3>🔄 2. Tab Sync That Actually Works</h3><p>Stop switching back and forth from your work to you ChatGPT tab, GPT Island 🏝️syncs your conversations <strong>across tabs</strong>. Jump between tasks without losing your place, your chat, or your thoughts.</p><h3>🧠 3. Selective Memory — You’re in Charge</h3><p>GPT Island 🏝️ introduces a <strong>unique memory control</strong> feature. You choose what gets remembered – Have you ever got frustrated with ChatGPT not remembering that first prompt you gave? Ever wanted ChatGPT to ‘forget’ a particular prompt or response? GPT Island solves this problem – simply click the 🏝️ icon on any message and it gets added to your context memory. How many tokens are in memory? we got you! GPT Island shows your context memory token count in the memory button. want to clear context memory? Just click the memory button!</p><h3>💾 4. Local Chat Storage</h3><p>Your conversations are stored <strong>locally</strong> — not in the cloud, not on someone else’s server. That means full control over your data and peace of mind.</p><h3>🚀 5. DeepSeek Access</h3><p>GPT Island 🏝️ doesn’t limit you to ChatGPT. You also get access to <strong>DeepSeek</strong>, a powerful, high-quality reasoning model for nuanced, accurate reasoning — perfect for research, writing, and ideation.</p><h3>💬 6. Huge 5 Million Tokens Per Month</h3><p>That’s right. You get up to <strong>5 million tokens per month</strong>. More context. More memory. More capability. No more worrying about running out of tokens during complex tasks. You can get started with our free option and 5000 tokens.</p><h3>💬 7. Bring your own API key option — BYOAK</h3><p>Already have an openai chat completions <strong>API key</strong> ? with GPT Island you can have the best of both worlds — ChatGPT’s powerful language models and GPT Islands 🏝️beautiful UI.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*R5QEY63NAC0KCcJXUi6sjg.png" /><figcaption>GPT Island</figcaption></figure><h3>Who is GPT Island 🏝️ For?</h3><ul><li><strong>Writers</strong> looking for in-context assistance while drafting</li><li><strong>Developers</strong> needing fast AI help while coding</li><li><strong>Researchers</strong> analyzing articles and data</li><li><strong>Anyone</strong> who wants ChatGPT — but <em>beautiful</em>, more flexible, more powerful and more embedded into daily workflows.</li></ul><h3>The Bottom Line</h3><p>GPT Island 🏝️ turns ChatGPT from a tool you <em>visit</em> into a tool that <em>lives with you online</em>. It’s faster, more beautiful , more powerful — and it gives you the control you’ve always wanted. you’ll never go back 😎</p><p>So, stop using ChatGPT like it’s 2022.<br> <strong>Start using GPT Island </strong>🏝️<strong>.</strong></p><p>visit <a href="http://gptisland.com">gptisland.com</a> for more details.</p><p><em>Ready to try it?</em><br> 👉 <a href="https://chromewebstore.google.com/detail/ooohpemfdkijlakokkfliajibiiiagnm">Install GPT Island </a>🏝️ <a href="https://chromewebstore.google.com/detail/ooohpemfdkijlakokkfliajibiiiagnm">from the Chrome Web Store</a></p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=bdb059c806b3" width="1" height="1" alt="">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Build a BitCoin(tegration) Backtester]]></title>
            <link>https://medium.com/@pdquant/build-a-bitcoin-tegration-backtester-83e2b19125fd?source=rss-55da8ebe8a7f------2</link>
            <guid isPermaLink="false">https://medium.com/p/83e2b19125fd</guid>
            <category><![CDATA[cryptocurrency]]></category>
            <category><![CDATA[python]]></category>
            <category><![CDATA[machine-learning]]></category>
            <category><![CDATA[blockchain]]></category>
            <category><![CDATA[programming]]></category>
            <dc:creator><![CDATA[Patrick David]]></dc:creator>
            <pubDate>Tue, 19 Feb 2019 11:02:30 GMT</pubDate>
            <atom:updated>2019-02-19T11:02:30.722Z</atom:updated>
            <content:encoded><![CDATA[<h4>Learn the statistical technique of Cointegration and build your own crypto backtester to create and test a quantitative trading strategy.</h4><figure><img alt="" src="https://cdn-images-1.medium.com/max/896/1*VVPMMaZUg77wb_Qnwb3rdA.gif" /></figure><h4>This tutorial is in 2 parts — (<strong>you can run the backtester as a separate standalone module</strong>) :</h4><ol><li>Learn the Statistical technique of Cointegration.</li><li>Build a Bitcoin Backtesting engine using Python to analyze the performance of a Cointegration based trading strategy.</li></ol><p><strong>Just want the code? </strong><a href="https://github.com/Patrick-David/BitcoinBacktester">click here</a>.</p><h4>What are we building</h4><figure><img alt="" src="https://cdn-images-1.medium.com/max/512/1*KBf6VUiI0tvfIU5l_ow98Q@2x.jpeg" /></figure><p>We are going to build a python based event-driven backtester that pulls 2 crypto securities <strong>Bitcoin </strong>(<strong>BTC</strong>)and <strong>Bitcoin Cash</strong> (<strong>BCH</strong>) from an API, passes it through a trading strategy that uses the mean reverting cointegration spread between the 2 securities and generates buy/sell signals when the spread hits ± 1 stdev. We then send these signals to the <em>Portfolio </em>class which handles the logic of the backtester. One time stamp will be pulled and processed at a time, allowing us to see what would have happened tick-by-tick. Finally we print the results to console (or jupyter notebook) and print out the PnL (profit and loss).</p><h3>Cointegration</h3><p>roadmap:</p><ul><li>what is time series</li><li>model assumptions</li><li>why does this happen</li><li>what is stationarity</li><li>orders of integration</li><li>cointegration</li></ul><h4><strong>Time Series</strong></h4><p>To understand Cointegration we first need to look at time series.</p><p>In cross sectional (non time series) regression models, if we are trying to predict some output value ‘Y’ we would have (one or more) corresponding input feature values ‘X’, we would learn some mapping between X and Y using something like least squares regression on a training set, then assess the performance on a test set. All very straight forward.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/470/1*BeS9sPwTuZFWucJpV4qhSw.png" /><figcaption>Regular (cross sectional) multiple regression model</figcaption></figure><p>In the case of time series, instead of using exogenous features ‘X’ we can use lags of the target output ‘Y’ in the form:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/371/1*JIXbWTkAmUolp9mH_pb9Rw.png" /><figcaption>Basic AR (autoregressive) time-series model</figcaption></figure><p>In simple terms, our predictor for today&#39;s observation is yesterdays observation. This is known as an Autoregressive model or AR model.</p><h4><strong>Model Assumptions</strong></h4><p>Most statistical models and techniques make the (often unrealistic) assumption of the input data being iid (independent and identically distributed); each data point being independent and all drawn from the same probability distribution.</p><p>For regression models we make the following assumptions:</p><ul><li><strong>Independence</strong> —<strong>Pr [ rank ⁡ ( X ) = p ] = 1.</strong> The input features (X) are statistically independent from each other. A full rank feature matrix.</li><li><strong>Linearity </strong>— <strong><em>y=b0 +b1x1…btXt.</em></strong> The relationship between dependent and independent variables is linear.</li><li><strong>Homoscedasticity </strong>— <strong>E[ <em>εi</em>² | <em>X</em> ] = <em>σ</em>². </strong>The variance of the errors is constant.</li><li><strong>Normality </strong>— <strong>ε ∣ X ∼ N ( 0 , σ²I n ). </strong>The error terms are normally distributed.</li><li><strong>No Autocorrelation</strong> — <strong>E[ <em>εiεj</em> | <em>X</em> ] = 0 for <em>i</em> ≠ <em>j</em>. </strong>The error terms are uncorrelated.</li><li><strong>Strict Exogeneity </strong>—<strong> E ⁡ [ ε ∣ …Xt-1, Xt, Xt+1… ] = 0. </strong>This assumes the errors are mean zero <strong>E[<em>ε</em>] = 0</strong>, and the errors are uncorrelated with the input features <strong>E[<em>X.ε</em>] = 0</strong>. Crucially this means each error term must be uncorrelated with <em>every value of X </em>past present and future.</li><li><strong>Weak Exogeneity </strong>(Optional — we’ll come back to this) — <strong>E ⁡ [ ε ∣ …Xt-1, Xt] = 0. </strong>Similar to ‘strict’ form except expectation only applies to <em>current </em>and <em>past</em> values (not future values of X).</li></ul><p>Failure to meet these any or all of these assumptions, can cause our models, whether inference or prediction, to be inefficient, inaccurate, incorrectly significant or harder to interpret than necessary. <a href="http://people.duke.edu/~rnau/testing.htm">See here for more information.</a></p><p>However, when we try to extend these regression assumptions from the cross-sectional domain to time-series, the two assumptions that are hardest to meet are the last two: Lack of <strong>Autocorrelation</strong> and <strong>Strict Exogeneity</strong> . In particular if these conditions do not hold then our regression coefficient estimate is biased as is the variance of the coefficient estimate.</p><h4><strong>Why does this happen?</strong></h4><p>The Law of Large Numbers (LLN) states that if our variables are iid (and have a finite expected value), then the sample means, variances and covariances will tend towards the true, population moments.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/456/1*NpJww2fI48aQJiQy1BnJsA.png" /><figcaption>Sample moments ==&gt; Population moments</figcaption></figure><p>When we have this nice asymptotic property of sample moments tending towards population moments, our regression estimators are efficient and unbiased, our standard errors and confidence intervals are trustworthy.</p><p>However, if we modify the above formula to make <em>xi </em>a function of time, x<em>t, </em>then the sample mean no longer converges to the population mean, it diverges to infinity!</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/399/1*hHP7zVZINn-KCXQ1G-vHCg.png" /><figcaption>The mean diverges to infinity!</figcaption></figure><p>This example also extends to the variance and covariance diverging too. It should be obvious now that if we were to run regression or any statistical analysis on this time trend data, that our results would be unreliable. Furthermore, in some situations, as we increase the sample size, this can make our model even worse!</p><blockquote>This is an example of non stationary data.</blockquote><p>A more subtle example shows a time series that we call ‘cyclo-stationary’:</p><p><em>y</em>(<em>t</em>)=<em>μ </em>+ <em>A</em>cos(2<em>πt</em>) + e</p><p>in this example if we calculate the full sample mean, <em>y/n</em> does indeed converge to the population mean <em>μ</em>. However if we choose a fixed time window as our sample, say <em>t` </em>(<em>tee prime</em>), then we converge to a <em>different </em>mean: <em>μ</em>+<em>A</em>cos(2<em>πt</em>′). Note, this is still a constant mean, but the two means clearly are not equal:</p><p><em>μ </em>!= <em>μ</em>+<em>A</em>cos(2<em>πt</em>′)</p><p>This is an example of a time series that is (cyclo) stationary but not ergodic.</p><blockquote>Only if we have both a stationary <em>and </em>ergodic time series, can we loosen our assumption of <strong>Strict Exogeneity</strong> to <strong>Weak Exogeneity</strong>. This allows us to meet the model assumptions and have a reliable model.</blockquote><h4>What is stationarity?</h4><p>So we’ve learnt something about stationarity, but what <em>is </em>a stationarity?</p><p>Definition: If we take a time series {Xt} (or any sequence of random variables) and define the joint distribution of a consecutive sub sequence as Fx(Xt1…Xtn), then define a 2nd joint distribution from a 2nd sequence Fx(Xt1+ 𝜏…Xtn+ 𝜏) then for all 𝜏,t,n, if Fx== Fx, we have a stationary series in {Xt}.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*s8tt_R7ArUj7QlZVI0ARzQ.png" /><figcaption>If 2 random vectors have same distribution they are Stationary.</figcaption></figure><blockquote>In words, a stationary process is one whose joint probability distribution doesn&#39;t change with a shift in time.</blockquote><p>The above definition implies <strong>Strict Stationarity</strong>. If instead of the <em>whole</em> distribution being the same, we just have the mean and covariance consistent throughout the time series, then we have <strong>Weak Stationarity</strong>.</p><p><strong>Note</strong>: iid is a stronger assumption than stationarity because stationarity makes no assumption about the data being independent, just that they are identically distributed.</p><blockquote>All iid sequences are stationary but the reverse does not hold true.</blockquote><h4>Integration</h4><p>The penultimate step before we get to Cointegration, is the concept of Integration denoted as <strong><em>I</em></strong>(i). Lets define a simple time series where the regressors are the error terms, defined as <em>Y-Ŷ = ε, </em>the true value minus the predicted value. The <em>bj </em>terms are the ‘weights’, denoting how much each error term influences Yt.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/135/1*BSElMLqURasEXeN5gEkZVw.png" /><figcaption>Moving Average (MA) series</figcaption></figure><p>Given this moving average series, if the following condition holds, then we call the series <strong><em>I</em></strong>(0). This conditions states that the autocorrelation (<em>influence of the error terms on Yt’s</em>) decays such that the variance of <em>bk </em>doesn&#39;t blow up to infinity<em>. </em>The mathy term is ‘<em>square summable</em>’.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/133/1*Xlx9M7KTWcdVP7oquFriAw.png" /><figcaption>Weights decay</figcaption></figure><p><strong>Note</strong>:<strong><em> I</em></strong>(0) is necessary for stationarity but not sufficient, So all stationary series are <strong><em>I</em></strong>(0), but not all <strong><em>I</em></strong>(0) are stationary.</p><p>If we cumulatively sum an <strong><em>I</em></strong>(0) series we get an <strong><em>I</em></strong>(1) series. The following python code generates an <strong><em>I</em></strong>(0) series sampling from a standard normal, then cumulatively sums those values to get an <strong><em>I</em></strong>(1). Note how the <strong><em>I</em></strong>(1) looks remarkably like a stock price chart! We could reverse the <strong><em>I</em></strong>(1) by taking the 1st difference of the series, by taking each price minus the previous price:</p><pre>x = pd.Series(index=range(1000))</pre><pre>#generate samples from standard normal</pre><pre>for i  in range(1000):</pre><pre>x[i] = (np.random.normal(0,1))</pre><pre>x_i_zero = x</pre><pre>#cumulatively sum the I(0) series to make it I(1)</pre><pre>x_i_one = np.cumsum(x)</pre><pre>plt.plot(x_i_zero, label = &#39;I(0)&#39;)</pre><pre>plt.plot(x_i_one, label = &#39;I(1)&#39;)</pre><pre>plt.legend()</pre><figure><img alt="" src="https://cdn-images-1.medium.com/max/640/1*9gG8YUl31HfDUWQl2_8HgA.png" /><figcaption><strong><em>I</em></strong>(0) cumulatively summed = <strong><em>I</em></strong>(1)</figcaption></figure><h4>Cointegration</h4><p>At a high level, if a linear combination of two or more non-stationary time series is stationary, then the entire set of time series is considered cointegrated.</p><blockquote>Definition:</blockquote><blockquote>Given a set of time series (or any sequence of RV’s) <em>{X</em>1, <em>X</em>2, …, <em>Xk</em>}, if all series are <strong><em>I</em></strong>(1) as is usually the case with financial data, then if some linear combination of them evaluates to an <strong><em>I</em></strong>(0) series, we call the set of time series Cointegrated.</blockquote><p>Formally, we are building a linear model (which we will see later can be done with regression) where the X’s are individually <strong><em>I</em></strong>(1) and therefore non stationary, that gives us a new singular time series Y, that is <strong><em>I</em></strong>(0) and stationary.</p><blockquote><em>Y</em>=<em>b</em>1<em>X</em>1+<em>b</em>2<em>X</em>2+⋯+<em>bkXk</em></blockquote><p>For example<em>, if X</em>1, <em>X</em>2, and <em>X</em>3 are all <strong><em>I</em></strong>(1), and the linear combination of 5<em>X</em>1+3<em>X</em>2+0<em>X</em>3=5<em>X</em>1+3<em>X</em>2 is <strong><em>I</em></strong>(0). Then in this case the time series set (X1, X2, X3) are cointegrated.</p><p>So how does this help us build a trading strategy? Well,</p><blockquote>if we can find 2 or more time series that are cointegrated, then that cointegrated time series, by definition would be <strong><em>I</em></strong>(0 ) and mean reverting. So we could generate signals whenever the series moved far away from its mean on the expectation that it will move back to the mean over time.</blockquote><h3>Building a BackTester</h3><blockquote>Lets build a BitCointegration BackTester! (Now it makes sense right?)</blockquote><p>roadmap:</p><ul><li>hypothesis testing</li><li>the strategy</li><li>building the backtester</li><li>backtesting pitfalls</li></ul><h4>Our hypothesis</h4><p>Before we begin our analysis and building of the backtester, we need to start with a hypothesis as to <em>why </em>two or more securities might be cointegrated.</p><p>This is an important starting point. If we simply scanned every tradable instrument over all time periods, then we would undoubtedly find a pair of instruments that showed cointegration. This is the curse of multiple comparison bias. Put simply, if you look at enough data, you will eventually find a result which matches your desired outcome, regardless of its statistical significance. Its crucial to understand this before we start and <a href="https://medium.com/@pdquant/stocks-significance-testing-p-hacking-how-volatile-is-volatile-1a0da3064b8a">I have an entire research piece on this topic here.</a></p><p><a href="https://medium.com/@pdquant/stocks-significance-testing-p-hacking-how-volatile-is-volatile-1a0da3064b8a">Stocks, Significance Testing &amp; p-Hacking: How volatile is volatile?</a></p><p>For our research, we will be using <strong>Bitcoin </strong>(<strong>BTC</strong>) and <strong>Bitcoin Cash </strong>(<strong>BCH</strong>). The base economic rationale for this is simple: BCH is a fork of BTC, therefore our hypothesis is that the 2 instruments *might* be cointegrated. We keep it as simple as that and then test this hypothesis.</p><p><strong>Note</strong>: <em>my post today is not about bitcoin or blockchain or the relative merits of one instrument over the other. There are plenty of other forums for that!. My aim is to teach the concept of cointegration and how to test for it statistically and how to build a backtester from scratch along with the many pitfalls. To get a quick overview of the difference between BTC and BCH </em><a href="https://coinsutra.com/btc-vs-bch-bitcoin-cash/"><em>read here</em></a><em>.</em></p><h4>The strategy</h4><p>Assuming that the analysis that we are about to do, finds that BTC and BCH are cointegrated, then based on what we have learned so far, we know that a linear combination of BTC and BCH, if cointegrated, will be stationary and therefore mean reverting. We will use this property to build a trading strategy, specifically as we’ll see in the next section, we will <strong>short</strong> the <strong>spread </strong>between <strong>BTC</strong> and <strong>b*BCH </strong>(b is a weight, that we will calculate) when it rises above 1 standard deviation (upper blue line) and <strong>long </strong>the<strong> spread </strong>if it moves below the lower blue line. Crucially we will take profit and <strong>close out</strong> the positition when the spread hits the mean (read line). Note, we dont have any stop loss implemented in this strategy, but it could easily be added.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/835/1*hQFdIXXSvK1izeLmh2g1Cw.png" /><figcaption>buy/sell signals generated at ±1std</figcaption></figure><p>When we go <strong>long </strong>the spread, this means we <strong>buy BTC</strong> and <strong>sell b*BCH</strong>. When we generate a <strong>short</strong> signal we <strong>sell BTC</strong> and <strong>buy b*BCH</strong>. Because this type of pairs trading strategy can get quite complicated we will restrict each buy/sell amount to $1000 each time we execute a trade.</p><blockquote>Example: if we generate a <strong>short </strong>signal, we would <strong>sell $1000</strong> of <strong>BTC</strong> and <strong>buy $1000</strong> of <strong>b*BCH</strong></blockquote><p>Th eagle eyed reader will notice that we are actually not buying <strong>$1000</strong> of <strong>BCH</strong> but <strong>$1000</strong> times some multiplier ‘<strong>b</strong>’ times <strong>BCH</strong>. Otherwise we would simply be trading the raw spread which is not cointegrated! We will account for this when we code the <em>Portfolio </em>class.</p><p>Here’s an overview of the Classes and methods we’ll be building:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/512/1*KBf6VUiI0tvfIU5l_ow98Q@2x.jpeg" /></figure><ul><li><strong>DataPuller </strong>— to pull, align and organize the data from API</li><li><strong>Portfolio </strong>— handles the logic of the cointegration pairs trade</li><li><strong>Strategy </strong>— runs event-driven backtester with ‘Portfolio’ as base class</li></ul><p>Lets get started with the usual imports:</p><pre><strong>#standard imports</strong></pre><pre>import requests<br>import numpy as np<br>import pandas as pd<br>import seaborn as sns<br>from scipy import stats<br>import matplotlib.pyplot as plt<br>%matplotlib inline</pre><pre><strong>#nice trick to make plots full width</strong></pre><pre>plt.rcParams[&#39;figure.figsize&#39;] = [15,5]</pre><p>We are going to use the <a href="https://min-api.cryptocompare.com/documentation?key=Price&amp;cat=SingleSymbolPriceEndpoint">CryptoCompare</a> API to get the price data:</p><pre><strong>#fetch daily OHLC prices for BTC, BCH</strong></pre><pre>btc = requests.get(&quot;<a href="https://min-api.cryptocompare.com/data/histoday?fsym=BTC&amp;tsym=USD&amp;limit=500">https://min-api.cryptocompare.com/data/histoday?fsym=BTC&amp;tsym=USD&amp;limit=500</a>&quot;).json()[&#39;Data&#39;]<br>bch = requests.get(&quot;<a href="https://min-api.cryptocompare.com/data/histoday?fsym=BCH&amp;tsym=USD&amp;limit=500">https://min-api.cryptocompare.com/data/histoday?fsym=BCH&amp;tsym=USD&amp;limit=500</a>&quot;).json()[&#39;Data&#39;]</pre><p>Next we put the data into a Pandas Dataframe and change the time column to a proper Pandas <em>DateTime </em>object and use the <em>rename </em>function to change the duplicate columns of ‘close’ to unique names, ‘btc’ and ‘bch’. We also select our starting date as 2017–12–12 (more on this later).</p><pre><strong>#put into dataframe</strong></pre><pre>btc_df = pd.DataFrame(btc)<br>bch_df = pd.DataFrame(bch)</pre><pre><strong>#use pandas datetime feature to convert timestamp into a datatime object with units = seconds</strong></pre><pre>btc_df[&#39;time&#39;] = pd.to_datetime(btc_df[&#39;time&#39;], unit=&#39;s&#39;)<br>bch_df[&#39;time&#39;] = pd.to_datetime(bch_df[&#39;time&#39;], unit=&#39;s&#39;)</pre><pre><strong>#use the newly created datetime object as index</strong></pre><pre>btc_df.set_index(&#39;time&#39;, inplace=True)<br>bch_df.set_index(&#39;time&#39;, inplace=True)</pre><pre><strong>#rename &#39;close&#39; for each instrument so they have unique names</strong></pre><pre>btc_df.rename({&#39;close&#39;:&#39;btc&#39;}, axis=1, inplace=True)<br>bch_df.rename({&#39;close&#39;:&#39;bch&#39;}, axis=1, inplace=True)</pre><pre><strong>#select our desired stating data</strong><br>btc_df = btc_df.loc[&#39;2017-12-12&#39;:]<br>bch_df = bch_df.loc[&#39;2017-12-12&#39;:]</pre><p>So here’s one of our cryptocurrency dataframes:</p><pre>btc_df.head()</pre><figure><img alt="" src="https://cdn-images-1.medium.com/max/755/1*Og4Shsm6KYT7dbF5Ws-_Tw.png" /></figure><p>For our purpose we just want the closing prices of both BTC and BCH, so we will use the <em>concat</em> function in pandas to merge just the closing price columns:</p><pre><strong>#we&#39;ll work with just the closing pries for this project, so concatenate the 2 columns together.</strong></pre><pre>df = pd.concat([btc_df[&#39;btc&#39;], bch_df[&#39;bch&#39;]],axis=1)</pre><pre><strong>#we&#39;ll also add the raw spread as a column</strong></pre><pre><strong>#calculate the spread between the 2 prices, for reference only.<br>#We will be trading the &#39;cointegration spread&#39; instead.</strong></pre><pre>df[&#39;spread&#39;] = df[&#39;btc&#39;] - df[&#39;bch&#39;]<br>df.head()</pre><figure><img alt="" src="https://cdn-images-1.medium.com/max/391/1*l_opb8vPVHqOkcCca1cZdg.png" /></figure><p>So that&#39;s our dataframe sorted, now we need to test the 2 time series, <strong>BTC </strong>and <strong>BCH </strong>to see if they are cointegrated. To do this we will import the <em>adfuller </em>and <em>coint </em>modules from <em>statsmodels </em>and select a training sample from both of our cryptocurrencies. We choose a 5 month window from the beginning of 2018. We also create a ‘spread’ series showing the difference between BTC and BCH just for reference.</p><p>The function <em>coint</em> basically fits a regression model, like we have already discussed and tests the <strong>null hypothesis</strong> that there is <strong>no cointegration</strong>, meaning we want to see a small p-value, so we can reject the null.</p><p><em>adfuller</em> tests for a ‘<a href="https://en.wikipedia.org/wiki/Unit_root">unit root</a>’ which would indicate the series is non stationary. Again we want a small p-value.</p><p><strong><em>adfuller </em></strong>implements the Augmented Dickey Fuller test for stationarity.</p><p><strong><em>coint</em></strong> implements the <a href="https://www.google.com/url?sa=t&amp;rct=j&amp;q=&amp;esrc=s&amp;source=web&amp;cd=3&amp;cad=rja&amp;uact=8&amp;ved=2ahUKEwiE8c-hjr7gAhXlSxUIHQrEC8EQFjACegQIBhAK&amp;url=https%3A%2F%2Fwarwick.ac.uk%2Ffac%2Fsoc%2Feconomics%2Fstaff%2Fgboero%2Fpersonal%2Fhand2_cointeg.pdf&amp;usg=AOvVaw1LWNNYdqFz_wZT4xiMmxh8">Engle Granger 2-Step method</a> for cointegration testing.</p><pre><strong>#test for cointegration</strong></pre><pre>from statsmodels.tsa.stattools import coint, adfuller<br>import statsmodels.api as sm</pre><pre><strong>#select a training sample</strong></pre><pre>btc_train, bch_train = df[&#39;btc&#39;].loc[&#39;2017-12-12&#39;:&#39;2018-4-30&#39;], df[&#39;bch&#39;].loc[&#39;2017-12-12&#39;:&#39;2018-4-30&#39;]<br>spread_train = btc_train - bch_train</pre><p>Lets throw our training set into the <em>coint </em>function and what we’re looking for is a p-value (<a href="https://medium.com/@pdquant/stocks-significance-testing-p-hacking-how-volatile-is-volatile-1a0da3064b8a">see here for more on p-values</a>) below a 5% significance level:</p><blockquote>This will imply <strong>BTC </strong>and <strong>BCH</strong> are cointegrated over the training period:</blockquote><pre><strong>#return p value<br>#coint returns 3 values t stat, p-value and critical value<br>#in python we can unpack all three on one line</strong></pre><pre>t,p,crit = coint(btc_train,bch_train)</pre><pre><strong>#test for significance</strong><br>print(p)<br>if p &lt;0.05:<br>    print(&#39;Cointegrated!&#39;)<br>else:<br>    print(&#39;NOT Cointegrated&#39;)</pre><figure><img alt="" src="https://cdn-images-1.medium.com/max/247/1*G8Rrek9e4VPfi0Ne3iC-iA.png" /></figure><p>Great! <strong>Bitcoin </strong>and <strong>Bitcoin Cash</strong> appear to be <strong>cointegrated</strong>. Well if they’re cointegrated the spread between them must also be stationary right?</p><pre><strong>#use adf to test for stationarity</strong></pre><pre>pval_spread = adfuller(spread_train)[1]<br>if pval_spread &lt;0.05:<br>    print(pval_spread,&#39;Data is Stationary!&#39;)<br>else:<br>    print(pval_spread, &#39;Data is NOT Stationary!&#39;)</pre><pre><strong>#note the spread itself is Not stationary as it assumes a &#39;Beta&#39; value of 1<br>#so we need to construct a linear model to find the optimal Beta value...</strong></pre><figure><img alt="" src="https://cdn-images-1.medium.com/max/510/1*XNzwd7So_Cwi0SuvDF-n8g.png" /><figcaption>Oops! The spread’s not stationary</figcaption></figure><p>So whats happened here? remember we defined cointegration as being a linear combination of the the time series that are stationary, not simply the raw 1-to-1 spread. But how do we find this linear combination? Well one technique we already know for finding a linear combination is linear regression!</p><p>If we have 2 time series <strong>X1</strong>,<strong>X2 </strong>then if we can define a linear model as:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/205/1*EFuQMLVgJZKfZY91D2eSvA.png" /></figure><p>which is therefore cointegrated and if we rearrange the algebra we can get:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/193/1*p4TCf7rkOXx4rMWmh4zm9Q.png" /></figure><p>In other words, if there is a linear combination of <strong>X1</strong> and <strong>X2, </strong>that gives a spread which is <strong><em>I</em></strong>(0),<strong> </strong>then by definition the spread is stationary and mean reverting.</p><p>So what we need to do is build a simple linear model between <strong>BTC</strong> and <strong>BCH </strong>and use the slope coefficient ‘beta’ from that equation to build our stationary spread series defined as ‘z’:</p><pre><strong>#build linear model to find beta that gives I(0) combination of pair</strong></pre><pre>X = sm.add_constant(bch_train)<br>result = sm.OLS(btc_train,X).fit()</pre><pre><strong>#result.params returns the intercept (const) and slope of the model. #We can ignorethe intercept and use &#39;b&#39; to build our cointegrated #series!?</strong></pre><pre><br>print(result.params)</pre><pre><strong>#define new stationary spread as &#39;z&#39;<br>#&#39;b&#39; value gives the parameter of our linear model</strong></pre><pre>b = result.params[&#39;bch&#39;]</pre><pre><strong>#simply define our new cointegrated series as z = btc - b*bch</strong><br>z = btc_train - b*bch_train</pre><figure><img alt="" src="https://cdn-images-1.medium.com/max/242/1*mblfmk_I_LDcUT3oNxBaUA.png" /><figcaption>intercept and slope coefficients of linear model</figcaption></figure><p>Now if we run the augmented dickey fuller test on this new linear combination of <strong>BTC</strong> and <strong>b*BCH</strong>:</p><pre><strong>#run adf again, this time on linear combination &#39;Z&#39;</strong></pre><pre>plt.plot(z)<br>z_pval = adfuller(z)[1]<br>if z_pval&lt;0.01:<br>    print(z_pval,&quot;Huzzah!, it&#39;s Stationary&quot;)<br>else:<br>    print(z_pval,&quot;:Not stationary&quot;)<br>plt.axhline(z.mean())</pre><figure><img alt="" src="https://cdn-images-1.medium.com/max/983/1*kVWCW6887NUm5Vmi7imPjA.png" /><figcaption>Huzzah it’s stationary!</figcaption></figure><p>It’s stationary! This means we have found a linear combination of BTC and BCH which is stationary (at least over the training period).</p><p>Lets think about what the series ‘z’ actually is and how we can construct a trading strategy from it. ‘z’ represents the difference (spread) between <strong>1</strong> unit of <strong>BTC</strong> and <strong>3.99</strong> units of <strong>BCH</strong>,<strong> </strong>which we have shown to be stationary. Without going too deep into the inner workings of ADF (Augmented Dickey Fuller) it checks for a ‘unit root’ which is a fancy way of saying the moments (mean, variance etc) depend on time ‘t’ and are therefore non-stationary. We want to reject the null hypothesis that a unit root exists.</p><p>Lets produce some plots to show visually what we have done. The following code plots 3 charts</p><ol><li>The raw spread between BTC and BCH (not cointegrated), we use this as a reference.</li><li>THIS IS IMPORTANT — we plot the full time series of BTC — b*BCH (training + test set) ASSUMING that stationarity hold not just for the training set but ALSO the test set (we will see later the consequences of this).</li><li>Shows a plot of the daily returns, we don&#39;t actually use this series but we include it as a potential alternative predictive feature that we might want to test.</li></ol><p>This final point (3) highlights the fact that I have arbitrarily selected ± 1 standard deviation on the cointegrated spread, as our trading strategy. We could just as easily use daily returns breaching 1.2345 stdev as our signal, or anything else!</p><p>Marked in green is the end of training set|begining of test set. We will see later why our assumption of stationarity in the training set holding true for the test set, is a bad idea!</p><pre><strong>#calculate cointegrated series &#39;full_z&#39; for the whole (train + test) dataset</strong></pre><pre>spread = df[&#39;spread&#39;]</pre><pre>full_z = df[&#39;btc&#39;] - b*df[&#39;bch&#39;]</pre><pre><strong>#lets plot the raw spread, the stationary spread and for reference the &#39;spread daily percent change&#39; or &#39;returns&#39;<br>#the green vertical line shows the end of the training set period.</strong></pre><pre>fig,ax = plt.subplots(3,1,sharex=True)</pre><pre>plt.tight_layout()<br>ax[0].set_title(&#39;Spread&#39;)<br>ax[0].plot(spread)<br>ax[0].axhline(spread.mean(),color=&#39;r&#39;)</pre><pre><strong>#stationary series &#39;z&#39; plotted with 1 standard deviation horizontal bars shown<br>#note standard dev bars are arbitrary and could be anything</strong></pre><pre>ax[1].set_title(&quot;Linear model &#39;z&#39;&quot;)<br>#plot inverse so its same as &#39;Spread&#39;<br>full_z_mu = full_z.mean()<br>ax[1].plot(full_z)<br>ax[1].axhline(full_z_mu+full_z.std(),ls =&#39;--&#39;)<br>ax[1].axhline(full_z.mean(),color=&#39;r&#39;)<br>ax[1].axhline(full_z_mu-full_z.std(),ls =&#39;--&#39;)</pre><pre><strong>#spread pct change  / returns with 1 standard deviation horizontal bars shown<br></strong><br>spread_pct = spread.pct_change(1)<br>#print(new_diff.head())<br>#print(new_df.head())<br>ax[2].set_title(&#39;Spread daily % change&#39;)<br>ax[2].plot(spread_pct)<br>ax[2].axhline(spread_pct.std(),ls=&#39;--&#39;)<br>ax[2].axhline(spread_pct.mean(),color=&#39;r&#39;)<br>ax[2].axhline(-spread_pct.std(),ls=&#39;--&#39;)</pre><pre><strong>#mark end of training sample in green</strong><br>for i in range(3):<br>    ax[i].axvline(&#39;2018-4-30&#39;,color=&#39;g&#39;)<br>#new_diff.rolling(20).mean().plot(style=&#39;r+&#39;)<br>#plt.axhline(color=&#39;r&#39;)<br>#plt.text(390,0,&#39;ZERO&#39;)<br>#new_diff.rolling(10).mean().plot(style=&#39;--&#39;)</pre><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*sGvKWRBpUQpN9lFVtrhSfw.png" /><figcaption>spread v ‘stationary spread’ v daily returns</figcaption></figure><p>By looking at the raw spread between BTC and BCH compared to the stationary spread, we can get a visual confirmation that the linear modeled spread ‘z’ appears to be reasonably bounded between ±1 std (blue lines) and centered around a constant mean (red line). This property seem to hold beyond the training set period (green line) too, but more on this later. The daily returns shown in the lower of the 3 plots, is just for reference and to show how volatility seems to correlate with major changes in the spread.</p><h3>The Backtester</h3><p>And finally we get to the actually backtester!</p><p>The first component we will build is the <em>Data_Puller</em> class.</p><p>we begin with the __init__() function by defining some variables we will use throughout the backtester; ticker1, ticker2 for holding the crypto pairs and a pandas dataframe named df3 to store the final results.</p><p>we define 2 functions (<em>actually they’re methods because they are within a class</em>):</p><p><strong>get_data()</strong>- to pull and merge the data from the API</p><p><strong>fetch_data()</strong>- to return the final dataframe so we can pass it to the next component.</p><pre><strong>#Data_Puller fetches crypto data, cleans then passes to container df3</strong></pre><pre><strong>#Class to store data for any pairs, crypto or otherwise</strong><br>class Data_Puller:<br>    def __init__(self,ticker1,ticker2,freq,periods):<br>        self.ticker1 = ticker1<br>        self.ticker2 = ticker2<br>        self.freq = freq<br>        self.periods = periods<br>        self.df3 = pd.DataFrame()<br>        <br>        <br>        <br><strong>    #method to pull, munge, store crypto pairs data</strong><br>    def get_data(self):<br>        #replace this in final merge<br>        b = 3.995977<br>        _data1 = requests.get(f&quot;<a href="https://min-api.cryptocompare.com/data/histo{self.freq}?fsym={self.ticker1}&amp;tsym=USD&amp;limit={self.periods">https://min-api.cryptocompare.com/data/histo{self.freq}?fsym={self.ticker1}&amp;tsym=USD&amp;limit={self.periods</a>}&quot;).json()[&#39;Data&#39;]<br>        _data2 = requests.get(f&quot;<a href="https://min-api.cryptocompare.com/data/histo{self.freq}?fsym={self.ticker2}&amp;tsym=USD&amp;limit={self.periods">https://min-api.cryptocompare.com/data/histo{self.freq}?fsym={self.ticker2}&amp;tsym=USD&amp;limit={self.periods</a>}&quot;).json()[&#39;Data&#39;]<br>        df1 = pd.DataFrame(_data1)<br>        df1_close = df1[&#39;close&#39;]<br>        df2 = pd.DataFrame(_data2)<br>        df2_close = df2[&#39;close&#39;]<br>        <br>        df1[&#39;time&#39;] = pd.to_datetime(df1[&#39;time&#39;],unit=&#39;s&#39;)<br>        df1.set_index(df1[&#39;time&#39;], inplace = True)<br>        <br>        df2[&#39;time&#39;] = pd.to_datetime(df2[&#39;time&#39;],unit=&#39;s&#39;)<br>        df2.set_index(df2[&#39;time&#39;], inplace = True)<br>        df1 = df1.drop([&#39;high&#39;,&#39;low&#39;,&#39;open&#39;,&#39;volumefrom&#39;,&#39;volumeto&#39;,&#39;time&#39;] ,axis=1)<br>        df2 = df2.drop([&#39;high&#39;,&#39;low&#39;,&#39;open&#39;,&#39;volumefrom&#39;,&#39;volumeto&#39;] ,axis=1)<br>        df1.rename(columns={&#39;close&#39;: &#39;BTC&#39;}, inplace=True)<br>        df2.rename(columns={&#39;close&#39;: &#39;BCH&#39;}, inplace=True)<br>        #print(df1.head())<br>        #print(df2.head())<br>        self.df3 = pd.concat([df1,df2],axis=1)<br>        #self.df3[&#39;spread&#39;] = self.df3[self.ticker1] - self.df3[self.ticker2]<br>        #self.df3[&#39;spread_pct_change&#39;] = self.df3[&#39;spread&#39;].pct_change()<br>        #add cointegration model X1 - X2 = should be stationary<br>        self.df3[&#39;full_z_coint&#39;] = self.df3[&#39;BTC&#39;] - b*self.df3[&#39;BCH&#39;]<br>        self.df3[&#39;b_x_bch&#39;] = b*self.df3[&#39;BCH&#39;]<br>        <br>        #prints df to check data<br>        print(self.df3)<br>        <br><strong>    #returns the final dataframe, with 1st element dropped as its nan for spread_pct_change    </strong><br>    def fetch_df(self):<br>        return self.df3.loc[&#39;2017-12-12&#39;:]</pre><p>To show the output of this class, lets instantiate the class and pass in the arguments <strong>(‘BTC’,’BCH’,’day’,500) , </strong>day is the frequency and 500 is the number of days.</p><p>to display all the results we can use pd.set_option(‘display.max_rows’, 400) to show the first 400 entries.</p><pre>x = Data_Puller(&#39;BTC&#39;,&#39;BCH&#39;,&#39;day&#39;,500)<br><strong>#pd.set_option(&#39;display.max_rows&#39;, 400)</strong><br>x.get_data()</pre><pre><strong>#instantiate Data_Puller class then fetch_data</strong><br>q = x.fetch_df()</pre><figure><img alt="" src="https://cdn-images-1.medium.com/max/657/1*wZIcsag6PnboIWE9B0DU8g.png" /></figure><p>At this stage its a bit ugly, but is has all the data we want. The <strong>full_z_coint </strong>column shows the spread between <strong>BTC</strong> and <strong>b*BCH </strong>(its the same as the ‘z’ variable we defined earlier in the post), this is effectively the ‘instrument’ we are going to trade as its stationary and mean reverting. Of course to trade this ‘spread’ we actually need to take a long position and a short position in BTC or b*BCH.</p><p>The variable b*BCH shows the value of BCH multiplied by the learned parameter ‘b’ which is approx 3.99, as we derived earlier.</p><p>So that’s <em>Data_Puller</em>. Next up is the<em> Portfolio </em>class. This is the most complex component as it does all of the heavy lifting in terms of trading logic and execution.</p><p>Lets walk through it.</p><p>The __init__() function defines a <em>dataframe </em>called _port() and we pass in the column names;</p><p><strong>ts</strong>=time stamp, <strong>signal</strong>= buy sell or hold (this logic will be built in the <em>Strategy </em>class next), <strong>action</strong>=indicates what action we took given the signal(bought,already bought, closed out etc),<strong> sold/bought</strong> value=dollar value of trade, <strong>U_pnl</strong>=the unrealized profit/loss showing the running pnl, <strong>R_pnl</strong>=realized profit/loss once we have actually closed out a position.</p><p>Next we initialize a few variables that will track what position we currently hold, running pnl etc.</p><p>Next is the <strong>close_out()</strong> function. This is important as it will be used throughout the backtester logic to close any open position when the trigger is met. We define the close out trigger event as being when the price hits or crosses the mean value.</p><p>What follows is the logic to handle each of the possible trade signals, which will be generated in the <em>Strategy </em>class; <strong>Hold, Long, Short</strong>. In each of these situations we do the following:</p><ul><li>check <strong>current_pos</strong> to see if we already have a position</li><li>calculate new s<strong>ell/buy units</strong> by taking our $1000 initial value and adjusting to today&#39;s price and quantity</li><li>if we have a position and the new price has crossed the mean value ‘close out’ threshold, then we run the <strong>close_out()</strong> function</li><li>update current position to reflect any changes</li><li>print out any actions to console</li><li>pass the new current values down to the end of the class to be added to our <strong>_port</strong> portfolio <em>dataframe</em></li></ul><pre><strong>#Portfolio class handles trade logic</strong><br>class Portfolio:<br>    def __init__(self):<br><strong>        </strong>#self.orders = pd.DataFrame(columns=[&#39;TS&#39;,&#39;Order&#39;,&#39;tick1&#39;,&#39;tick2&#39;])<br>        self._port = pd.DataFrame(columns=[&#39;ts&#39;,&#39;signal&#39;,&#39;action&#39;,&#39;sold_value&#39;,&#39;bought_value&#39;,&#39;U_pnl&#39;,&#39;R_pnl&#39;])<br>#        self.current_budget = 1000000<br>        self.signal = None<br>        self.prev = None<br>        #bought / sold<br>        self.current_pos= &quot;empty&quot;<br>        #self.pnl = pd.DataFrame(columns = [&#39;pnl&#39;])<br>        self.bought_sold_price = 0<br>        self.stamp = 0<br>        #self.sold_value = 0<br>        #self.bot_value = 0<br>        self.sell_units = 0<br>        self.buy_units = 0<br>        self.value_2 = 0<br>        self.value_1 = 0<br>        self.rpl = 0<br>    <br>    def close_out(self):<br>        self.rpl += (1000 - self.value_2) + (self.value_1 - 1000)<br>        self.current_pos =&#39;empty&#39;</pre><pre>print(&quot;close out position&quot;)<br>        <br>          <br>    def position(self,ts,tick1,tick2,price,tot_trade_amount=2000):<br>        print()<br>        print(self.stamp)<br>        print(&#39;current pos:&#39;,self.current_pos)<br>        print(&quot;bought / sold price: &quot;,self.bought_sold_price)<br>        print(&#39;this is prev:&#39;, self.prev)<br>        print(&#39;this is the signal:&#39;,self.signal)<br>        single_trade_amount = tot_trade_amount/2<br>        action = None<br>        <br><strong>#logic for Hold signal</strong>        <br>        <br>        if self.signal ==&quot;Hold&quot;:<br>            <br>            if self.current_pos ==&#39;sold&#39;:<br>                print(&quot;sold tick&quot;)<br>                self.value_2 = self.sell_units * tick2<br>                self.value_1 = self.buy_units * tick1<br>            elif self.current_pos == &#39;bought&#39;:<br>                self.value_2 = self.sell_units * tick1<br>                self.value_1 = self.buy_units * tick2<br>            else:<br>                print(&quot;Hold neither bought nor sold&quot;)<br>                self.value_2 = 0<br>                self.value_1 = 0<br>                <br>                <br>                <br>          <br>            print(&quot;hold 1&quot;)<br>            print(&quot;caputured by Hold&quot;)<br>            <br>            if self.current_pos == &#39;bought&#39; and price &gt; self.mu:<br>                print(&quot;hold 2&quot;)<br>                self.close_out()<br>                action = &quot;Closed out Long&quot;<br>                #self.current_pos =&#39;empty&#39;<br>                <br>            elif self.current_pos ==&#39;sold&#39; and price &lt; self.mu:<br>                print(&quot;hold 3&quot;)<br>                self.close_out()<br>                action = &quot;Closed out Short&quot;<br>                #self.current_pos =&#39;empty&#39;<br>            else:<br>                print(&quot;hold 4&quot;)<br>                print(&quot;&quot;&quot;take no action -&gt; Hold&quot;&quot;&quot;)<br>                action = &quot;Held&quot;<br>        <br><strong>#logic for Short signal</strong>        <br>        <br>        elif self.signal ==&#39;Short&#39;:<br>            <br>            print(&quot;caputrd by Short&quot;)<br>            sell_units = single_trade_amount/tick2<br>            buy_units = single_trade_amount/tick1<br>            <br>            if self.signal == &#39;Short&#39; and  self.signal != self.prev:<br>                print(&quot;short 1&quot;)<br>                if self.current_pos == &#39;bought&#39;:<br>                    self.value_2 = self.sell_units * tick2<br>                    self.value_1 = self.buy_units * tick1<br>                    self.close_out()<br>                elif self.current_pos == &#39;empty&#39;:<br>                    <br>                    print(&quot;short 2&quot;)</pre><pre><br>                    print(&quot;Went short: sold&quot;,sell_units,&quot;units of BTC&quot;,&quot;at a price of&quot;,tick2, &quot;and bought&quot;,buy_units,&quot;of b*BCH at a price of&quot;,tick1)<br>                    #self.sold_value = sell_units*tick2<br>                    #self.bot_value = buy_units*tick1<br>                    self.sell_units = sell_units<br>                    self.buy_units = buy_units<br>                    self.value_2 = self.sell_units * tick2<br>                    self.value_1 = self.buy_units * tick1</pre><pre>self.bought_sold_price = tick2 - tick1<br>                    self.current_pos = &#39;sold&#39;<br>                    action = &quot;Went Short!&quot;</pre><pre>else:<br>                    print(&quot;short 5&quot;)<br>                    print(&quot;current pos must be already sold - check!&quot;)<br>                    action = &quot;Already Short!&quot;<br>                    self.value_2 = self.sell_units * tick2<br>                    self.value_1 = self.buy_units * tick1</pre><pre>else:<br>                print(&quot;short 6&quot;)<br>                print(&quot;prev signal must be Short - check!&quot;)<br>                action = &quot;Already Short!&quot;<br>                self.value_2 = self.sell_units * tick2<br>                self.value_1 = self.buy_units * tick1</pre><pre><br>            <br><strong>#logic for Long signal</strong>            <br>        <br>        elif self.signal ==&#39;Long&#39;:<br>            <br>            print(&quot;captured by Long&quot;)<br>            sell_units = single_trade_amount/tick1<br>            buy_units = single_trade_amount/tick2<br>            <br>            if self.signal == &#39;Long&#39; and self.signal != self.prev:<br>                print(&quot;long 1&quot;)<br>                if self.current_pos == &#39;sold&#39;:<br>                    self.value_2 = self.sell_units * tick1<br>                    self.value_1 = self.buy_units * tick2<br>                    self.close_out() <br>                    action = &quot;short =&gt; close out&quot;<br>                elif self.current_pos == &quot;empty&quot;:<br>                    <br>                    print(&quot;long 2&quot;)</pre><pre><br>                    print(&quot;Went Long: sold&quot;,sell_units,&quot;units of b*BCH&quot;,&quot;at a price of&quot;,tick1, &quot;and bought&quot;,buy_units,&quot;of BTC at a price of&quot;,tick2)<br>                    #self.sold_value = sell_units*tick1<br>                    #self.bot_value = buy_units*tick2<br>                    self.sell_units = sell_units<br>                    self.buy_units = buy_units<br>                    self.value_2 = self.sell_units * tick1<br>                    self.value_1 = self.buy_units * tick2</pre><pre>self.bought_sold_price = tick2 - tick1<br>                    self.current_pos = &#39;bought&#39;<br>                    action = &quot;Went Long!&quot;<br>                    print(&quot;should be 1000&quot;, single_trade_amount)<br>                    print(&quot;tot trade amount&quot;, tot_trade_amount)</pre><pre>else:<br>                    print(&quot;long 5&quot;)<br>                    print(&quot;current pos must be already long - check&quot;)<br>                    action = &quot;Already Long!&quot;<br>                    self.value_2 = self.sell_units * tick1<br>                    self.value_1 = self.buy_units * tick2<br>            else:<br>                print(&quot;long 6&quot;)<br>                print(&quot;prev signal must be long - check!&quot;)<br>                action = &quot;Already Long!&quot;<br>                self.value_2 = self.sell_units * tick1<br>                self.value_1 = self.buy_units * tick2</pre><pre>else:<br>            print(&quot;not captured 1&quot;)<br>            print(&quot;not captured by buy sell or hold need to fix!&quot;)<br>            <br>            <br>        print(self.sell_units)<br>        print(self.buy_units)<br>        print(ts)<br>        #print(&quot;tick1: &quot;, tick1, &quot;tick2: &quot;, tick2)</pre><pre><strong>#calculate unrealized pnl and finally update _port()</strong></pre><pre>        urpl = (1000 - self.value_2) + (self.value_1 - 1000)<br>        self._port.loc[len(self._port)] = [ts,self.signal,action,self.value_2,self.value_1,urpl,self.rpl]</pre><pre>self.prev = self.signal<br>        self.stamp+=1</pre><p>Lets remind ourselves of the ‘spread’ that we are trading. Any time the price moves above the top blue line, we short the spread by selling <strong>BTC </strong>and buying <strong>b*BCH</strong>. If the price goes below the lower blue line we do the opposite; buy <strong>BCT </strong>and sell<strong> b*BCH. </strong>When the spread value crosses the mean (red line) we close out any position we may have.</p><p>Remember there is one thing missing from this trading strategy-stop loss!</p><p>This code generates a plot to show the ‘spread’ that we are trading. I’ve added the index number of the trades for reference:</p><pre><strong>#shows the spread we are trading with mean (red line) and +- 1std (blue lines)</strong></pre><pre>_mu = np.mean(q.full_z_coint)<br>plt.plot(q.full_z_coint)<br>plt.axhline(np.mean(q.full_z_coint),color=&#39;r&#39;)<br>plt.axhline(_mu+np.std(q.full_z_coint),color=&#39;b&#39;)<br>plt.axhline(_mu-np.std(q.full_z_coint),color=&#39;b&#39;)</pre><pre><strong>#plot every 5th index for debugging and reference</strong><br>for i ,txt in enumerate([x for x in range(len(q))]):<br>    if i%5==0:<br>        plt.annotate(txt,(q.index[i],q.full_z_coint[i]))<br>    <br>print(&#39;mu&#39;,np.mean(q.full_z_coint))<br>print(&#39;std&#39;,np.std(q.full_z_coint))</pre><figure><img alt="" src="https://cdn-images-1.medium.com/max/980/1*XKlUfAom9q7Z4aFP8fS0iA.png" /></figure><p>The final component is the <em>Strategy </em>class. This executes the event driven backtester by pulling one row of data at a time from the <em>Data_Puller</em> class, passing it through the backtester logic handled by the <em>Portfolio </em>class via the position() method. This is the core of an event driven backtester. Rather that simply vectorizing the whole time series and applying the trading rules to all points at the same time, we simulate what would happen with our strategy tick-by-tick. This gives us a much more realistic simulation and an expandable framework whereby we could add in functionality to account for transaction costs, slippage, liquidity, microstructure events etc, into our backtest.</p><pre><strong>#create strategy to perform on any pair.</strong><br>class Strategy(Portfolio):<br>    <br>    def __init__(self):<br>        #use Super to get Portfolio attrs<br>        Portfolio.__init__(self)<br>        #price_feed = Data_Puller().fetch_df()<br>        self.sdev = np.std(q.full_z_coint)<br>        self.mu = np.mean(q.full_z_coint)<br>        <br>    <br>    <strong>#go long / short if +- 1 std, sell when hit mean</strong><br>    def strat(self):<br>        <br>        while q.empty==False:<br>        <br>            <br>            <strong>#print(&#39;running...&#39;)<br>            #pop .loc and drop it...</strong><br>            btc,bch,ts,z_coint,b_x_bch = q.iloc[0]<br>            q.drop(q.head(1).index,inplace=True)<br>                        <br>            #compare to plus / minus 1 stdev -&gt; generate signal<br>            if z_coint &gt; self.mu + self.sdev:<br>                #self.orders.loc[len(self.orders)] = [ts,&#39;Short&#39;,btc,bch]<br>                self.signal = &#39;Short&#39;<br>                self.position(ts,b_x_bch,btc,z_coint)<br>                <br>            elif z_coint &lt; self.mu - self.sdev:<br>                #self.orders.loc[len(self.orders)] = [ts,&#39;Long&#39;,btc,bch]<br>                self.signal = &#39;Long&#39;<br>                self.position(ts,b_x_bch,btc,z_coint)<br>                <br>                            <br>            <br>            else:<br>                #self.orders.loc[len(self.orders)] = [ts,&#39;Hold&#39;,btc,bch]<br>                self.signal = &#39;Hold&#39;<br>                self.position(ts,b_x_bch,btc,z_coint)<br>                <br>            <br>            <br>            <br>            <br>            #print(self.current_position)<br>        <br>        print(&#39;Finished!&#39;)<br>        <br>            <br><strong>#function to return tick by tick printout and R_pnl chart</strong>               <br>    def get_portfolio(self):<br>        self._port.set_index(&#39;ts&#39;,inplace=True)<br>        plt.plot(self._port.R_pnl)<br>        plt.show()<br>        pd.set_option(&#39;display.max_rows&#39;, 400)<br>        return self._port.head(360)<br>        #return self._port</pre><p>Lets run the backtester and see what happens. First we instantiate the <em>Strategy </em>class and run the strat() method. This will start printing out a real time tick by tick display of various bits of information to show what the backtesting engine is doing:</p><pre>p = Strategy()<br><br>p.strat()</pre><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*oy8oYqBNX8Tk-i6rp8YnqA.gif" /><figcaption>backtester prints out tick-by-tick events</figcaption></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/977/1*RmNDAIjW5YEdJOMRIwMsJQ.png" /><figcaption>sample of print out as backtester runs</figcaption></figure><p>Here’s a selection of what’s included in the print out:</p><p><em>index</em></p><p><em>previous signal</em> (for debugging)</p><p><em>captured by</em> shows which logic is triggered (short/long/hold) again for debugging,</p><p><em>short1, short2</em> references which part of the ‘short’ logic is activated</p><p>a string printout of what we’ve bought and sold</p><p>the two floating point numbers represent the ‘sell and buy units’ that is the amount of BTC and BCH that we buy given that each new position is always $1000 long and $1000 short.</p><p>finally we print out a plot showing the realized pnl and a pandas dataframe showing all the tick-by-tick events that have happened:</p><pre>p.get_portfolio()</pre><figure><img alt="" src="https://cdn-images-1.medium.com/max/960/1*roxO04mJYQ6Jds8xZMNGNQ.png" /><figcaption>plot of Realized pnl in $</figcaption></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/699/1*wWh4VD-P1o3EBZz9uIk61Q.png" /><figcaption>final printout of all events in backtester</figcaption></figure><p>We made a profit! But remember what we said earlier, the way this algo is structured means that there are no stop losses, only take profit signals (assuming that stationarity holds).</p><h4>Backtesting pitfalls</h4><blockquote>The purpose of a backtester, is that it is a historical simulation of how a strategy would have performed.</blockquote><p>Backtesting is one of the most misunderstood concepts in finance. In financial literature it is done badly, with many authors committing structural and statistical errors in their backtester. Below is a list of the “7 sins of quantitative investing” <a href="http://newyork.qwafafew.org/wp-content/uploads/sites/4/2015/10/Luo_20150128.pdf">by Luo et al [2014]</a>.</p><ul><li><strong>Survivorship bias — </strong>using an investment universe that doesnt include companies that went bankrupt / delisted. The S&amp;P500 today is different that 10yrs ago.</li><li><strong>Look-ahead bias — </strong>using data that was not available at time of the simulation.</li><li><strong>Storytelling — </strong>justifying the results after the event or simply selecting the data that fits your predetermined ‘story’.</li><li><strong>Data snooping — </strong>incorporating test data in training data.</li><li><strong>Transaction costs — </strong>simple backtesters don’t account for slippage, costs, fees etc.</li><li><strong>Outliers — </strong>using extreme results with a low probability of ever occurring again.</li><li><strong>Shorting — </strong>related to transaction costs, the cost of selling short is unknown unless you actually made the trade.</li></ul><blockquote>Glancing through this list you may notice something…We have committed most of these sins!</blockquote><p>Not only did we fall prey to some of these pitfalls we also fell for many more! As an example, remember at the beginning of our analysis we arbitrarily split the data into training | test sets ? Well, look what happens when we shift our <strong>training </strong>set window forward by jut 2 weeks…<strong>its no longer cointegrated!</strong></p><pre><strong>#shift training set window forward by 2 weeks</strong><br>shifted_train_btc, shifted_train_bch = df[&#39;btc&#39;].loc[&#39;2017-12-26&#39;:&#39;2018-5-13&#39;], df[&#39;bch&#39;].loc[&#39;2017-12-26&#39;:&#39;2018-5-13&#39;]</pre><pre>t,p,crit = coint(shifted_train_btc,shifted_train_bch)</pre><pre><strong>#test for significance</strong><br>print(p)<br>if p &lt;0.05:<br>    print(&#39;Cointegrated!&#39;)<br>else:<br>    print(&#39;NOT Cointegrated&#39;)</pre><figure><img alt="" src="https://cdn-images-1.medium.com/max/227/1*TPVTozqHY0P11nLrYmedvw.png" /></figure><p>As another example, our original linear model based on the training set we know to be stationary, but what about the test set? after all, this is what counts when running a backtester:</p><p><strong>The test set is not stationary!</strong></p><pre><strong>#run adf again, this time on Test set</strong></pre><pre>plt.plot(z)<br>z_pval = adfuller(z)[1]<br>if z_pval&lt;0.01:<br>    print(z_pval,&quot;Huzzah!, it&#39;s Stationary&quot;)<br>else:<br>    print(z_pval,&quot;:Not stationary&quot;)<br>plt.axhline(z.mean(),color=&#39;r&#39;)<br>plt.axhline(z.mean() + z.std(),color=&#39;b&#39;)<br>plt.axhline(z.mean() - z.std(),color=&#39;b&#39;)</pre><figure><img alt="" src="https://cdn-images-1.medium.com/max/407/1*FrAhnX5kxv4jQ-sV8eF6zw.png" /><figcaption>test set model is NOT stationary!</figcaption></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/900/1*kaLapW2mI0Ndrp7k3TbfIA.png" /><figcaption>test set model is NOT cointegrated!</figcaption></figure><p>Clearly we have committed a number of mistakes when it comes to building a statistically robust backtester. The main issues in our case, are based around the statistical properties of time-series data. In particular, that one window of data can have a particular distribution, but another (close by) window can be completely different!</p><p>There is a labrinthine rabbit hole we could go down here with backtesting pitfalls, but this post is long enough already.</p><h4>Next steps</h4><p>We have learnt the statistical technique of cointegration along with stationarity and time series. We then learned how to test for it using python statsmodels functions <em>coint</em> and <em>adfuller.</em></p><p>We constructed a hypothesis for why BTC and BCH might be cointegrated, test for it and build a non-trivial, event driven backtester using python.</p><p>Finally we looked at what we did wrong and the potential pitfalls when conduction backtesting.</p><ul><li>Now you should run the backtester using either the jupyter notebook or directly in the terminal using the .py file <a href="https://github.com/Patrick-David/BitcoinBacktester">which can both be found here</a>.</li></ul><p><a href="https://github.com/Patrick-David/BitcoinBacktester">Patrick-David/BitcoinBacktester</a></p><ul><li>For further exploration, try using some of the other crypto pairs available on the <a href="https://min-api.cryptocompare.com/documentation?key=Price&amp;cat=SingleSymbolPriceEndpoint">cryptocompare API</a> to see if they are cointegrated.</li><li>You could extend the logic of the <em>Portfolio </em>class to account for transaction costs.</li><li>You could try a rolling window when testing for stationarity, to ensure the whole series is stationary, not just a select window.</li><li>Currently the logic executes a new buy/sell trade simultaneously to when the signal was generated, you may want to try generating a buy/sell signal at close of play on one day, then executing the trade the next day ( this isn’t necessary for crypto which is 24/7 but equities have a fixed trading day).</li><li>I’d love to hear your thoughts on any of the topics discussed!</li><li>Read my other blog posts on statistical testing…</li></ul><p><a href="https://medium.com/@pdquant/stocks-significance-testing-p-hacking-how-volatile-is-volatile-1a0da3064b8a">Stocks, Significance Testing &amp; p-Hacking: How volatile is volatile?</a></p><ul><li>if you are interested in Deep Learning and want to learn how backpropagation works, check out this tutorial…</li></ul><p><a href="https://medium.com/@pdquant/all-the-backpropagation-derivatives-d5275f727f60">All the Backpropagation derivatives</a></p><ul><li>Follow me for more on quant finance, deep learning and more!</li><li>Say hi on twitter at <a href="https://twitter.com/pdquant">twitter.com/pdquant</a></li></ul><p><a href="https://twitter.com/pdquant">Patrick David (@pdquant) | Twitter</a></p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=83e2b19125fd" width="1" height="1" alt="">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Stocks, Significance Testing & p-Hacking: How volatile is volatile?]]></title>
            <link>https://medium.com/@pdquant/stocks-significance-testing-p-hacking-how-volatile-is-volatile-1a0da3064b8a?source=rss-55da8ebe8a7f------2</link>
            <guid isPermaLink="false">https://medium.com/p/1a0da3064b8a</guid>
            <category><![CDATA[finance]]></category>
            <category><![CDATA[statistics]]></category>
            <category><![CDATA[machine-learning]]></category>
            <category><![CDATA[data-science]]></category>
            <category><![CDATA[data-visualization]]></category>
            <dc:creator><![CDATA[Patrick David]]></dc:creator>
            <pubDate>Fri, 19 Oct 2018 21:39:53 GMT</pubDate>
            <atom:updated>2018-12-11T15:35:58.962Z</atom:updated>
            <content:encoded><![CDATA[<h4>October is historically the most volatile month for stocks, but is this a persistent signal or just noise in the data?</h4><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*i0X1f5ZwDis4212KzH361g.png" /><figcaption>Ave Monthly Vol Ranking</figcaption></figure><blockquote>Stocks, Significance Testing &amp; p-Hacking.</blockquote><blockquote>Over the past 32 years, October has been the most volatile month on average for the S&amp;P500 and December the least, in this article we will use simulation to assess the statistical significance of this observation and to what extent this observation could occur by chance. All code included!</blockquote><p>This post is now live on DataCamp — <a href="https://www.datacamp.com/community/tutorials/stocks-significance-testing-p-hacking">click here!</a></p><h4>Our goal:</h4><ul><li>Demonstrate how to use Pandas to analyze Time Series</li><li>Understand how to construct a hypothesis test</li><li>Use simulation to perform hypothesis testing</li><li>Show the importance of accounting for multiple comparison bias</li></ul><h4>Our data:</h4><p>We will be using <a href="https://github.com/Patrick-David/Stocks_Significance_PHacking">Daily S&amp;P500 data</a> for this analysis, in particular we will use the raw daily closing prices from 1986 to 2018 (which is surprisingly hard to find so I’ve made it <a href="https://github.com/Patrick-David/Stocks_Significance_PHacking">publicly available</a>).</p><p>The inspiration for this post came from <a href="https://www.winton.com/research/seasonal-volatility-and-the-multiplicity-effect">Winton</a>, which we will be reproducing here, albeit with 32 years of data vs their 87 years.</p><p>For those on Kaggle I’ve created an <a href="https://www.kaggle.com/pdquant/stocks-significance-testing-p-hacking/notebook">interactive Kernel</a> — give it an upvote!</p><h4>Wrangle with Pandas:</h4><p>To answer the question of whether the extreme volatility seen in certain months really is significant, we need to transform our 32yrs of price data into a format that shows the phenomena we are investigating.</p><p>Our format of choice will be <strong>average monthly volatility rankings (AMVR).</strong></p><p>The following code shows how we get our raw price data into this format:</p><pre><strong>#standard imports</strong><br>import numpy as np<br>import pandas as pd<br>import matplotlib.pyplot as plt<br>import matplotlib.patches as mpatches<br>%matplotlib inline<br>import seaborn as sns</pre><p>A nice trick to make charts display full width in Jupyter notebooks:</p><pre><strong>#resize charts to fit screen</strong><br>plt.rcParams[‘figure.figsize’]=[15,5]</pre><p>import our data and convert into daily returns using pct_change</p><pre><strong>#Daily S&amp;P500 data from 1986==&gt;2018</strong><br>url = &quot;<a href="https://raw.githubusercontent.com/Patrick-David/AMVR/master/spx.csv">https://raw.githubusercontent.com/Patrick-David/AMVR/master/spx.csv</a>&quot;<br>df = pd.read_csv(url,index_col=&#39;date&#39;, parse_dates=True)</pre><pre><strong>#To model returns we will use daily % change</strong><br>daily_ret = df[&#39;close&#39;].pct_change()</pre><pre><strong>#drop the 1st value which is a NaN</strong><br>daily_ret.dropna(inplace=True)</pre><pre><strong>#daily %change</strong><br>daily_ret.head()<br></pre><figure><img alt="" src="https://cdn-images-1.medium.com/max/229/1*W0Tr-QeIzFWfy5gXrvcFlQ.png" /></figure><p>Now we use one of the more powerful tools in Pandas: Resample. This allows us to change the frequency of our data from daily to monthly and to use standard deviation as a measure of volatility. These are the design choices we have when constructing an analysis.</p><pre><strong>#use pandas to resample returns per month and take standard deviation as measure of Volatility</strong><br><strong>#then annualize by multiplying by sqrt of number of periods (12)</strong><br>mnthly_annu = daily_ret.resample(‘M’).std()* np.sqrt(12)</pre><pre>print(mnthly_annu.head())</pre><pre><strong>#we can see major market events show up in the volatility</strong><br>plt.plot(mnthly_annu)<br>plt.axvspan(‘1987’,’1989&#39;,color=’r’,alpha=.5)<br>plt.axvspan(‘2008’,’2010&#39;,color=’r’,alpha=.5)<br>plt.title(‘Monthly Annualized vol — Black Monday and 2008 Financial Crisis highlighted’)<br>labs = mpatches.Patch(color=’red’,alpha=.5, label=”Black Monday &amp; ’08 Crash”)<br>plt.legend(handles=[labs])</pre><p>A quick look at the annualized monthly vol, shows major market events clearly, such as Black Monday and the 2008 Financial Crisis.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*yFj116_LblV96ijsBxR5-w.png" /></figure><p>Here’s where we can use the power of pandas to group our volatility by year and create a ranking for each of the 12 months over all 32 years of data</p><pre><strong>#for each year, rank each month based on volatility lowest=1 Highest=12</strong><br>ranked = mnthly_annu.groupby(mnthly_annu.index.year).rank()</pre><pre><strong>#average the ranks over all years for each month</strong><br>final = ranked.groupby(ranked.index.month).mean()</pre><pre>final.describe()</pre><figure><img alt="" src="https://cdn-images-1.medium.com/max/222/1*Tx5LU0ojM7gS_9yZTS0oFQ.png" /></figure><p>this gives our final <strong>Average Monthly Volatility Rankings. </strong>Numerically we can see that month 10 (October) is the highest and 12 (December) is the lowest.</p><pre><strong>#the final average results over 32 years </strong><br>final</pre><figure><img alt="" src="https://cdn-images-1.medium.com/max/223/1*bnCcn2uSrBbAXoEIQYVu2A.png" /></figure><p>By plotting our AMVR we can clearly see the most volatile month has been October and the lowest, December.</p><pre><strong>#plot results for S&amp;P AMVR: clearly October has the highest ave vol rank and December has the lowest. Mean of 6.45 is plotted</strong></pre><pre>b_plot = plt.bar(x=final.index,height=final)<br>b_plot[9].set_color(‘g’)<br>b_plot[11].set_color(‘r’)<br>for i,v in enumerate(round(final,2)):<br> plt.text(i+.8,1,str(v), color=’black’, fontweight=’bold’)<br>plt.axhline(final.mean(),ls=’ — ‘,color=’k’,label=round(final.mean(),2))<br>plt.title(‘Average Monthly Volatility Ranking S&amp;P500 since 1986’)</pre><pre>plt.legend()<br>plt.show()</pre><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*i0X1f5ZwDis4212KzH361g.png" /><figcaption>Ave Monthly Vol Ranking</figcaption></figure><p>So that’s our data, now onto Hypothesis testing…</p><h4>Hypothesis Testing: What’s the question?</h4><p>Hypothesis testing is one of the most fundamental techniques of data science, yet it is one on the most intimidating and misunderstood. The basis for this fear, is the way it is taught in Stats 101, where we are told to:</p><p><em>perform a t-test, is it a one-sided or two-sided test?, choose a suitable test-statistic such as Welch’s t-test, calculate degrees of freedom, calculate the t score, look up the critical value in a table, compare critical value to t statistic ……</em></p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*_yxe_kGGtT9m4qd6ahGohQ.png" /></figure><p>Understandably, this is leads to confusion over what test to conduct and how to conduct it. However, all of these classical statistical techniques for performing a hypothesis test, were developed at a time when we had very little computing power and were simply closed-form analytical solutions for calculating a p-value,that&#39;s it! But with the added complication of needing to pick the right formula for the given situation, due to their restrictive and sometimes opaque assumptions.</p><p>But rejoice!</p><p>There is a better way. Simulation.</p><p>To understand how simulation can help us, lets remind ourselves what a hypothesis test is:</p><p>We wish to test<strong> “whether the observed effect in our data is real or whether it could happen simply by chance”.</strong></p><p>and to perform this test we do the following:</p><ol><li>Choose an appropriate ‘test statistic’: this is simply a number that measures the observed effect. In our case we will choose the<strong> absolute deviation in AMVR from the mean.</strong></li><li>Construct a Null Hypothesis: this is simply a version of the data where the observed effect in not present. In our case we will shuffle the labels of the data repeatedly (<a href="https://speakerdeck.com/jakevdp/statistics-for-hackers"><strong>permutation</strong></a>). The justification for this is detailed below.</li><li>Compute a p-value: this is the probability of seeing the observed effect amongst the null data, in other words, by chance. <strong>We do this through repeated simulation</strong> of the null data. In our case, we shuffle the ‘date’ labels of the data many times and simply count the occurrence of our test statistic as it appears through multiple simulations.</li></ol><blockquote>That’s hypothesis testing in 3 steps! No matter what phenomena we are testing, the question is always the same: “<strong>is the observed effect real, or is it due to chance”</strong></blockquote><p><a href="http://allendowney.blogspot.com/2011/05/there-is-only-one-test.html">There is only one test! This great blog by Allen Downey has more details on hypothesis testing</a></p><p><strong>The real power of simulation is that we have to make explicit what our model assumptions are through code</strong>. Whereas classical techniques can be a ‘black-box’ when it comes to there assumptions.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*UW9OJHyVcl0qWWikSv0kpQ.png" /><figcaption>Example: The left plot shows the true data and the observed effect with a certain probability (green). The right plot is our simulated null data with a recording of when the observed effect was seen by chance (red). This is the basis of hypothesis testing, what is the probability of seeing the observed effect in our null data.</figcaption></figure><p>The most important part of hypothesis testing is being clear what question we are trying to answer. In our case we are asking:</p><p><strong>“Could the most extreme value happen by chance?”</strong></p><p>The most extreme value we define as the <strong>greatest absolute AMVR deviation from the mean</strong>. This question forms our null hypothesis.</p><p>In our data the most extreme value is the December value (1.23) not the October value (1.08), because we are looking at the biggest absolute deviation from the mean not simply the highest volatility:</p><pre><strong><br>#take absolute AMVR from the mean, we see Dec and Oct are the biggest absolute moves (Oct to the upside, Dec to the downside), with Dec being the greatest.</strong></pre><pre>fin = abs(final — final.mean())<br>print(fin.sort_values())<br>Oct_value = fin[10]<br>Dec_value = fin[12]<br>print(‘Extreme Dec value:’, Dec_value)<br>print(‘Extreme Oct value:’, Oct_value)</pre><figure><img alt="" src="https://cdn-images-1.medium.com/max/296/1*Rxiz4OXTV9adhrneGyw6tA.png" /></figure><h4>Simulation</h4><p>Now we know what question we are asking, we need to construct our ‘Null Model’.</p><p>There are a number of options here:</p><ol><li><strong>Parametric models</strong>. If we had a good idea of the data distribution or simply made assumption on it, we could use ‘classical’ hypothesis testing techniques, <em>t-test, X², one-way ANOVA etc. </em>These models can be restrictive and something of a blackbox if their assumptions aren’t fully understood by the researcher.</li><li><strong>Direct Simulation. </strong>We could make assumptions about the <a href="https://www.google.com/url?sa=t&amp;rct=j&amp;q=&amp;esrc=s&amp;source=web&amp;cd=4&amp;cad=rja&amp;uact=8&amp;ved=2ahUKEwj0qMn97IDeAhViBsAKHUEJDsQQFjADegQIBRAC&amp;url=http%3A%2F%2Fwww.rimini.unibo.it%2Ffanelli%2Feconometric_models2_2012.pdf&amp;usg=AOvVaw1IL6i7LNyZ16Yp_mz7-Ek_">data generating process</a> and simulate directly. For example we could specify an ARMA time series model for the financial data we are modeling and deliberately engineer it to have no seasonality. This could be a reasonable choice for our problem, however if we knew the data generating process for the S&amp;P500 we would be rich already!</li><li><strong>Simulation through Resampling. </strong>This is the approach we will take. By repeatedly sampling at random from the existing dataset and shuffling the labels, we can make the observed effect equally likely amongst all labels in our data (<em>in our case the labels are the dates</em>), thus giving the desired null dataset.</li></ol><p>Sampling is a big topic but we will focus on one particular technique<strong>, permutation </strong>or shuffling<strong>.</strong></p><p>To get the desired null model, we need to construct a dataset that has <strong>no seasonality</strong> present. If the null is true, <em>that there is no seasonality in the data and the observed effect was simply by chance</em>, then the labels for each month (Jan,Feb etc) are meaningless and therefore we can shuffle the data repeatedly to build up what classical statistics would call the <a href="https://www.youtube.com/watch?time_continue=441&amp;v=5Dnw46eC-0o">‘<em>sampling distribution of the test statistic under the null hypothesis</em>’</a>. This has the desired effect of making the observed phenomena (<em>the extreme December value</em>) <strong>equally likely</strong> for all months, which is exactly what our null model requires.</p><p>To prove how powerful simulation techniques are with modern computing power, the code in this example will actually permute the <em>daily </em>price data, which requires lots more processing power, yet still completes in seconds on a modern CPU. Note:<em> shuffling either the daily or the monthly labels will give us the desired null dataset in our case.</em></p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*-rzHR7JrlTjcXAYp4zKCUA.gif" /><figcaption>Shuffle ‘date’ labels to create null dataset</figcaption></figure><p>A great resource for learning about sampling is by <a href="http://www.resample.com/intro-text-online/">Julian Simon.</a></p><p>Note: <em>The way our test is constructed is the equivalent of a two sided test using ‘classical’ methods, such as Welch’s t-test or ANOVA etc, because we are interested in the most extreme value, either above or below the mean.</em></p><p>These decisions are design choices and we have this freedom because <strong>the Null Model is just that, it’s a Model!</strong> This means we can specify its parameters as we choose, the key is to really be clear what question we are trying to answer.</p><p>Below is the code to simulate 1000 sets of 12 AMVR, permuting the date labels each time to build up the sampling distribution. The output from this code is included below in the p-hacking section…</p><pre><strong>#as our Null is that no seasonality exists or alternatively that the month / day does not matter in terms of AMVR, we can shuffle &#39;date&#39; labels.<br>#for simplicity, we will shuffle the &#39;daily&#39; return data, which has the same effect as shuffling &#39;month&#39; labels</strong></pre><pre>#generate null data</pre><pre>new_df_sim = pd.DataFrame()<br>highest_only = []</pre><pre>count=0<br>n=1000<br>for i in range(n):<br><strong>    #sample same size as dataset, drop timestamp</strong><br>    daily_ret_shuffle = daily_ret.sample(8191).reset_index(drop=True)<br><strong>    #add new timestamp to shuffled data</strong><br>    daily_ret_shuffle.index = (pd.bdate_range(start=&#39;1986-1-3&#39;,periods=8191))<br>    <br>    <strong>#then follow same data wrangling as before...</strong><br>    mnthly_annu = daily_ret_shuffle.resample(&#39;M&#39;).std()* np.sqrt(12)<br>    <br>    ranked = mnthly_annu.groupby(mnthly_annu.index.year).rank()<br>    sim_final = ranked.groupby(ranked.index.month).mean()<br>    <strong>#add each of 1000 sims into df</strong><br>    new_df_sim = pd.concat([new_df_sim,sim_final],axis=1)<br>    <br>    <strong>#also record just highest AMVR for each year (we will use this later for p-hacking explanation)</strong><br>    maxi_month = max(sim_final)<br>    highest_only.append(maxi_month)</pre><pre><strong>#calculate absolute deviation in AMVR from the mean</strong><br>all_months = new_df_sim.values.flatten()<br>mu_all_months = all_months.mean()<br>abs_all_months = abs(all_months-mu_all_months)</pre><pre><strong>#calculate absolute deviation in highest only AMVR from the mean</strong><br>mu_highest = np.mean(highest_only)<br>abs_highest = [abs(x - mu_all_months) for x in highest_only]</pre><h4>p-Hacking</h4><p>Here’s the interesting bit. We’ve constructed a hypothesis to test, we’ve generated simulated data by shuffling the ‘date’ labels of the data , now we need to perform our hypothesis test to find the probability of observing a result as significant as the December result given that the null hypothesis (<em>no seasonality</em>) is true.</p><p>Before we perform the test lets set our expectations.</p><p><strong>Whats the probability of seeing <em>at least one</em> significant result given a 5% significance level?</strong></p><p>= 1-p(not significant)</p><p>= 1-(1–0.05)¹²</p><p>= 0.46</p><p>so there’s a <strong>46%</strong> chance of seeing <strong>at least one </strong>month with a significant result, given our null is true.</p><p>Now lets ask, <strong>for each individual test</strong> (<em>comparing each of the 12 months absolute AMVR to the mean</em>)<strong> how many significant values should we expect to see amongst our random, non-seasonal data?</strong></p><p>12 x 0.05 = 0.6</p><p>So with a 0.05 significance level we should expect a false positive rate of<strong> 0.6</strong>. In other words, for each test (with the null data) comparing all 12 months AMVR to the mean, 0.6 months will have show a significant result.(<em>obviously we cant have less than 1 month showing a result, but under repeat testing the math should tend towards this number</em>).</p><p>All the way through this work, we have stressed the importance of being really clear with the question we are trying to answer. The problem with the expectations we’ve just calculated is <strong>we have assumed we are testing for a significant result against all 12 months! </strong>That’s why the probability of seeing at least one false positive is so high at 46%.</p><p>This is an example of <a href="https://www.quantopian.com/lectures/p-hacking-and-multiple-comparisons-bias"><strong>multiple comparison bias</strong></a><strong>. </strong>Where we have expanded our search space and increased the likelihood of finding a significant result.<strong> </strong>This a problem because we could abuse this phenomena to cherry pick the parameters of our model which give us the ‘desired’ p-value.</p><blockquote>This is the essence of p-hacking.</blockquote><p>This xkcd nicely highlights the issue with multiple comparison bias and it’s subsequent abuse, p-hacking.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/540/0*zZgRnBSbwUN3b6D5.png" /><figcaption>Whoa! Green Jelly beans are significant</figcaption></figure><p>To illustrate the effect of p-hacking and how to reduce multiplicity, we need to understand the subtle but significant difference between the following 2 question:</p><ol><li>“Whats the probability that <strong>December </strong>would appear this extreme by chance.”</li><li>“Whats the probability <strong>any month</strong> would appear this extreme by chance.”</li></ol><p>The beauty of simulation lies in its simplicity. The following code is all we need to compute the p-value to answer the 1st question. We simply count how many values in our dataset using all 12000 AMVR deviations (12 months x 1000 trials) are greater than the observed December value. We get a <strong>p-value of 4.4%, </strong>close to our somewhat arbitrary 5% cut off, but <strong>significant </strong>none the less.</p><pre><strong>#count number of months in sim data where abs AMVR is &gt; Dec<br>#Here we are comparing against ALL months</strong><br>count=0<br>for i in abs_all_months:<br>    if i&gt; Dec_value:<br>        count+=1<br>ans = count/len(abs_all_months)        <br>print(&#39;p-value:&#39;, ans )</pre><p>p-value: 0.044</p><p>To answer the 2nd question and to avoid multiplicity, instead of comparing our result to the distribution made with all 12000 AMVR deviations, we only consider the <strong>highest </strong>value from each of the absolute AMVR 1000 trials. This gives a <strong>p-value of 23%</strong>, very much <strong>not significant!</strong></p><pre><strong>#same again but just considering highest abs AMVR for each of 1000 trials</strong><br>count=0<br>for i in abs_highest:<br>    if i&gt; Dec_value:<br>        count+=1<br>ans = count/len(abs_highest)        <br>print(&#39;p-value:&#39;, ans )</pre><p>p-value: 0.234</p><p>Now lets plot these distributions.</p><pre><strong>#calculate 5% significance </strong><br>abs_all_months_95 = np.quantile(abs_all_months,.95)<br>abs_highest_95 = np.quantile(abs_highest,.95)</pre><pre><strong>#plot the answer to Q1 in left column and Q2 in right column</strong></pre><pre>fig, ((ax1,ax2),(ax3,ax4)) = plt.subplots(2,2,sharex=&#39;col&#39;,figsize=(12,12))</pre><pre><strong>#plot 1</strong><br>ax1.hist(abs_all_months,histtype=&#39;bar&#39;)<br>ax1.set_title(&#39;AMVR all months&#39;)<br>ax1.set_ylabel(&#39;Frequency&#39;)<br>n,bins,patches = ax3.hist(abs_all_months,density=1,histtype=&#39;bar&#39;,cumulative=True,bins=30)<br>ax3.set_ylabel(&#39;Cumulative probability&#39;)<br>ax1.axvline(Dec_value,color=&#39;b&#39;,label=&#39;Dec Result&#39;)<br>ax3.axvline(Dec_value,color=&#39;b&#39;)<br>ax3.axvline(abs_all_months_95,color=&#39;r&#39;,ls=&#39;--&#39;,label=&#39;5% Sig level&#39;)</pre><pre><strong>#plot2</strong><br>ax2.hist(abs_highest,histtype=&#39;bar&#39;)<br>ax2.set_title(&#39;AMVR highest only&#39;)<br>ax2.axvline(Dec_value,color=&#39;b&#39;)<br>n,bins,patches = ax4.hist(abs_highest,density=1,histtype=&#39;bar&#39;,cumulative=True,bins=30)<br>ax4.axvline(Dec_value,color=&#39;b&#39;)<br>ax4.axvline(abs_highest_95,color=&#39;r&#39;,ls=&#39;--&#39;)</pre><pre>ax1.legend()<br>ax3.legend()</pre><p>The left column is the data that answers question 1 and the right column, question 2. The top row are the probability distributions and the bottom row are the CDF. The red dashed line is the 5% significance level that we arbitrarily decided upon. The blue line is the original extreme December AMVR value of 1.23.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*QkvI0N643QJETMrYSk8frg.png" /><figcaption>probability distributions for Q1 and Q2</figcaption></figure><p>The left side plot shows that the original December value is significant at a 5% level, but only just! However, when we account for multiple comparison bias, in the right hand plot the threshold for significance moves up from around 1.2 (abs AMVR) up to around 1.6 (see the redline).</p><blockquote><strong>By accounting for multiple comparison bias our December value of 1.23 is no longer significant!</strong></blockquote><p>By taking into consideration the specific question we are trying to answer and avoiding multiple comparison bias, we have avoided p-hacking our model and avoided showing a significant result when there isn’t one.</p><p>To further explore p-hacking and how it can be abused to tell a particular story about our data, see this great interactive app from FiveThirtyEight…</p><p><a href="https://fivethirtyeight.com/features/science-isnt-broken/#part2">Science Isn&#39;t Broken</a></p><h4>Conclusions</h4><p>We have learnt that hypothesis testing is not the big scary beast we thought it was. Simply follow the 3 steps above to construct your model for any kind of data or test statistic.</p><p>We’ve show that asking the right question is vital for scientific analysis. A slight change in the wording can lead to a very different model with very different results.</p><p>We discussed the importance of recognizing and correcting for multiple comparison bias and avoiding the pitfalls of p-hacking and showed how a seemingly significant result can become non-significant.</p><p>With more and more ‘big data’ along with academic pressure to produce a research paper with ‘novel’ findings or political pressure to show a result as being ‘significant’, the temptation for p-hacking is ever increasing. By learning to recongise when we are guilty of it and correcting for it accordingly we can become better researchers and ultimately produce more accurate and therefore actionable scientific results!</p><p>Authors Notes: <em>Our results differ slightly from the original </em><a href="https://www.winton.com/research/seasonal-volatility-and-the-multiplicity-effect"><em>Winton research,</em></a><em> this due in part to having a slightly different data set (32yrs vs 87yrs) and they have October being the month of interest whereas we have December. Also they used an undisclosed method for their ‘simulated data’ whereas we have made explicit, through code our methodology for creating that data. We have made certain modeling assumptions through out this work, again, these have been made explicit and can be seen in the code. These design and modeling choices are part of the scientific process, so long as they are made explicit, the analysis has merit.</em></p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*D0_bUZNtw_3aq_CyMqj0dw.gif" /></figure><p>Hope you found this useful and interesting, follow to be notified with my latest posts!</p><p>Follow <a href="http://twitter.com/pdquant">twitter.com/pdquant</a> for more!</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=1a0da3064b8a" width="1" height="1" alt="">]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[All the Backpropagation derivatives]]></title>
            <link>https://medium.com/@pdquant/all-the-backpropagation-derivatives-d5275f727f60?source=rss-55da8ebe8a7f------2</link>
            <guid isPermaLink="false">https://medium.com/p/d5275f727f60</guid>
            <category><![CDATA[neural-networks]]></category>
            <category><![CDATA[deep-learning]]></category>
            <category><![CDATA[ai]]></category>
            <category><![CDATA[calculus]]></category>
            <category><![CDATA[machine-learning]]></category>
            <dc:creator><![CDATA[Patrick David]]></dc:creator>
            <pubDate>Thu, 07 Jun 2018 19:06:13 GMT</pubDate>
            <atom:updated>2019-03-14T15:57:46.891Z</atom:updated>
            <content:encoded><![CDATA[<figure><img alt="" src="https://cdn-images-1.medium.com/max/781/1*BbnX1QQ5Qisdyg8ufSVfsg.gif" /></figure><p>So you’ve completed Andrew Ng’s <a href="http://www.coursera.org/specializations/deep-learning">Deep Learning course</a> on Coursera,</p><p>You know that <a href="https://www.google.com/url?sa=t&amp;rct=j&amp;q=&amp;esrc=s&amp;source=web&amp;cd=1&amp;cad=rja&amp;uact=8&amp;ved=0ahUKEwi4ycLk-b_bAhWCDcAKHfdfCN0QtwIIJzAA&amp;url=https%3A%2F%2Fwww.coursera.org%2Flearn%2Fneural-networks-deep-learning%2Flecture%2FMijzH%2Fforward-propagation-in-a-deep-network&amp;usg=AOvVaw0u4bgTyJGhdxdZlr4EVT8x">ForwardProp </a>looks like this:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/205/1*PjrDe-iJAGOJ3Lo0WnDZcQ.png" /><figcaption>Forwardpropagation Equations</figcaption></figure><p>And you know that <a href="https://www.google.com/url?sa=t&amp;rct=j&amp;q=&amp;esrc=s&amp;source=web&amp;cd=3&amp;cad=rja&amp;uact=8&amp;ved=0ahUKEwjE5oD4-b_bAhWHI8AKHbeVCxkQFggvMAI&amp;url=https%3A%2F%2Fen.wikipedia.org%2Fwiki%2FBackpropagation&amp;usg=AOvVaw3VzcK_f9TmfUd9PbbMS5tb">Backprop </a>looks like this:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/476/1*ohfvtAN6AOV8mhBwr4PApQ.png" /><figcaption>Backprop Equations</figcaption></figure><p>But do you know how to derive these formulas?</p><h3>TL;DR</h3><p><strong><em>Full derivations of all Backpropagation derivatives used in Coursera Deep Learning, using both chain rule and direct computation.</em></strong></p><p>If you’ve been through backpropagation and not understood how results such as</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/86/1*UMDs_vXLrbQiM78yZGav9w.png" /><figcaption>The derivative of our linear function - dz</figcaption></figure><p>and</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/320/0*O5wMJE1HH8orsug4" /><figcaption>derivative of Cost w.r.t activation ‘a’</figcaption></figure><p>are derived, if you want to understand the direct computation as well as simply using chain rule, then read on…</p><h3>Our Neural Network</h3><figure><img alt="" src="https://cdn-images-1.medium.com/max/601/1*1AgQfh5ueWaV0-htUFQrOg.png" /><figcaption>Neural Net taken from Coursera Deep Learning.</figcaption></figure><p>This is the simple <a href="https://www.google.com/url?sa=t&amp;rct=j&amp;q=&amp;esrc=s&amp;source=web&amp;cd=2&amp;cad=rja&amp;uact=8&amp;ved=0ahUKEwjK2pKigcDbAhUpIMAKHRUQDywQFgg1MAE&amp;url=https%3A%2F%2Fen.wikipedia.org%2Fwiki%2FArtificial_neural_network&amp;usg=AOvVaw0CMek9gs4Tdr7H_YcH9z1H">Neural Net</a> we will be working with, where x,W and b are our inputs, the “z’s” are the linear function of our inputs, the “a’s” are the (<a href="https://en.wikipedia.org/wiki/Sigmoid_function">sigmoid</a>) activation functions and the final</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/102/1*4-WXVWPxMbkS5kVMlnZ2rg.png" /><figcaption>Cross Entropy cost function</figcaption></figure><p>is our <a href="https://math.stackexchange.com/questions/1074276/how-is-logistic-loss-and-cross-entropy-related">Cross Entropy or Negative Log Likelihood </a>cost function.</p><p>So here’s the plan, we will work backwards from our cost function</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/602/1*tIuzCy1NspdXVJkyl7FsYA.png" /><figcaption>Our cost function</figcaption></figure><p>and compute directly, the derivative of</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/102/1*4-WXVWPxMbkS5kVMlnZ2rg.png" /><figcaption>Cross Entropy cost function</figcaption></figure><p>with respect to (<em>w.r.t</em>) each of the preceding elements in our Neural Network:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/212/1*9nTTDAgPgYsuXO4DGAKyFA.png" /><figcaption>The derivatives of L(a,y) w.r.t each element in our NN</figcaption></figure><p>As well as computing these values <em>directly</em>, we will also show the <em>chain rule </em>derivation as well.</p><p><strong># Note: we don’t differentiate our input ‘X’ because these are fixed values that we are given and therefore don’t optimize over.</strong></p><h3>[1] Derivative w.r.t activation function</h3><figure><img alt="" src="https://cdn-images-1.medium.com/max/44/1*DltYsgL0XYevOzZssqi5jw.png" /><figcaption>[1] derivative of our activation function</figcaption></figure><p>So to start we will take the derivative of our cost function</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/102/1*4-WXVWPxMbkS5kVMlnZ2rg.png" /></figure><p>w.r.t the activation function</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/53/1*PX_yILBXYhgkMn21D0-Gew.png" /><figcaption>Activation function 2</figcaption></figure><p>So we are taking the derivative of the Negative log likelihood function (Cross Entropy) , which when expanded looks like this:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/503/1*b7PplEufrAWKj33dYZSQug.png" /><figcaption>Taking derivative of our cost function</figcaption></figure><p>First lets move the minus sign on the left of the brackets and distribute it inside the brackets, so we get:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/447/0*SB1t4U8rMugokzYH" /><figcaption>distribute minus sign</figcaption></figure><p>Next we differentiate the left hand side:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/77/0*aoHxqwUu420XKTPc" /><figcaption>l.h.s</figcaption></figure><p>The right hand side is more complex as the derivative of ln(1-a) is not simply 1/(1-a), we must use chain rule to multiply the derivative of the inner function by the outer.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/352/1*v8AiMXAwJTfjKbkuYyQvug.png" /><figcaption>the derivative of a log</figcaption></figure><p>The derivative of (1-a) = -1, this gives the final result:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/382/0*Wu1IzeBhz9qHxOTC" /><figcaption>derivative of L w.r.t activation ‘a’</figcaption></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/320/0*QjgVAf3bbE3Jr5t6" /><figcaption>final result</figcaption></figure><p>And the proof of the derivative of a log being the inverse is as follows:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/479/1*MLpNREFGLiLZO-jNsW1BrA.png" /><figcaption>proof for the derivative of a log</figcaption></figure><h3>[2] Derivative of sigmoid</h3><figure><img alt="" src="https://cdn-images-1.medium.com/max/44/1*H5S49BM3FLX6bM-kdRE5xw.png" /><figcaption>[2] derivative of sigmoid</figcaption></figure><p>It is useful at this stage to compute the derivative of the sigmoid activation function, as we will need it later on.</p><p>our logistic function (sigmoid) is given as:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/64/1*jgS1pQwkbqhzQO_pIa8nCw.png" /><figcaption>Sigmoid (Logistic) function</figcaption></figure><p>First is is convenient to rearrange this function to the following form, as it allows us to use the chain rule to differentiate:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/143/1*aYP4xhbyl5_9xexleubXFw.png" /><figcaption>Rearranged sigmoid function</figcaption></figure><p>Now using chain rule: multiplying the outer derivative by the inner, gives</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/330/1*kWMWFPIcQ_jysiojAZQdLA.png" /><figcaption>outer derivative x inner derivative</figcaption></figure><p>which rearranged gives</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/89/1*Qsz4EhQsd_2-oVJUqIXpiQ.png" /><figcaption>put RHS over LHS</figcaption></figure><p>Here’s the clever part. We can then separate this into the product of two fractions and with a bit of algebraic magic, we add a ‘1’ to the second numerator and immediately take it away again:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/196/1*EBgQhSoMbHgddm3toweXNw.png" /><figcaption>add a ‘1’ and subtract a ‘1’ on RHS</figcaption></figure><p>The RHS then simplifies to</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/137/1*HwKjqROO67Kd9alEYJBsAQ.png" /></figure><p>Which is nothing more than</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/155/1*uAjvhR9RqGvxzWadbboiZg.png" /><figcaption>1 minus our sigmoid</figcaption></figure><p>Which gives a final result of</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/260/1*ZDqKPbX4r3CDOg5sQHIF8A.png" /></figure><p>Or alternatively:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/109/1*vMnOX25lIxz0HhBmtQKaxw.png" /><figcaption>This notation will be easier</figcaption></figure><h3>[3] Derivative w.r.t linear function</h3><figure><img alt="" src="https://cdn-images-1.medium.com/max/44/1*aflWeoRQggpkanuV8Ll_vA.png" /><figcaption>[3] derivative of our linear function (z = wX + b)</figcaption></figure><p>To get this result we can use chain rule by multiplying the two results we’ve already calculated [1] and [2]</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/200/1*3SVjbCwFogIv2r8tVQOlEw.png" /><figcaption>multiply derivative [1] by derivative [2]</figcaption></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/390/1*-s_gqPJoTbIBS8HVkrgBgw.png" /><figcaption>der[1] x der[2]</figcaption></figure><p>So if we can get a common denominator in the left hand of the equation, then we can simplify the equation, so lets add ‘(1-a)’ to the first fraction and ‘a’ to the second fraction</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/565/1*fgDN5qv4O-Aah_e7S9oCBg.png" /><figcaption>add ‘(1–a)’ and ‘a’ to get common denominator</figcaption></figure><p>with a common denominator we can simplify to</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/550/1*Fxnz9dbvaQIgtoxKEG4DQQ.png" /><figcaption>common denominator</figcaption></figure><p>now we multiply LHS by RHS, the a(1-a) terms cancel out and we are left with just the numerator from the LHS!</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/361/1*F_zz8QMZKnPklTb4M35jCw.png" /><figcaption>the remaining numerator</figcaption></figure><p>which if we expand out gives:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/277/1*sikT7LfnDaY6ugSRoJItYA.png" /><figcaption>expanded out</figcaption></figure><p>note that ‘ya’ is the same as ‘ay’, so they cancel to give</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/110/1*Ka9uzbmSgBCY9SYata0uMw.png" /></figure><p>which rearranges to give our final result of the derivative</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/44/1*aflWeoRQggpkanuV8Ll_vA.png" /><figcaption>[3]</figcaption></figure><p>our final result is</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/89/1*x1CdzcXePWPbe_UDFesVJQ.png" /><figcaption>derivative of our linear function (z = wX +b)</figcaption></figure><h3>[4] Derivative w.r.t weights</h3><figure><img alt="" src="https://cdn-images-1.medium.com/max/53/1*QRtUbub_wbV50A6MIUr5zQ.png" /><figcaption>[4] derivative of linear func ‘z’ w.r.t weights ‘w’</figcaption></figure><p>This derivative is trivial to compute, as z is simply</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/151/1*SzgaoykS7unrnn4_DcEGXg.png" /><figcaption>linear function ‘z’</figcaption></figure><p>and the derivative simply evaluates to</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/22/1*nLzvPtvLt3ousFBXhOBo-w.png" /><figcaption>derivative of ‘z’ w.r.t ‘w’</figcaption></figure><h3>[5] Derivative w.r.t weights (2)</h3><figure><img alt="" src="https://cdn-images-1.medium.com/max/53/1*ZldlktnFepaqaKngd3mbXg.png" /><figcaption>[5] derivative of cost func w.r.t weights ‘w’</figcaption></figure><p>This derivative can be computed <strong>two different ways!</strong> We can use <strong>chain rule </strong>or <strong>compute directly</strong>. We will do both as it provides a great intuition behind backprop calculation.</p><p>To use chain rule to get derivative [5] we note that we have already computed the following</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/181/1*_PD-ma_eeMW9F7Myfw1WwA.png" /><figcaption>previously computed</figcaption></figure><p>Noting that the product of the first two equations gives us</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/44/1*aflWeoRQggpkanuV8Ll_vA.png" /></figure><p>if we then continue using the chain rule and multiply this result by</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/53/1*QRtUbub_wbV50A6MIUr5zQ.png" /></figure><p>then we get</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/98/1*a1HWkDKLTPvplVFKruszyg.png" /></figure><p>which is nothing more than</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/135/1*8kFYcl9B92Q8sSzLl3rt7A.png" /><figcaption>The final result for ‘dw’</figcaption></figure><p>or written out long hand</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/288/1*RjvusiVI271tcVOG3r2iVQ.png" /><figcaption>chain rule result for ‘dw’</figcaption></figure><p>So that’s the ‘<em>chain rule way</em>’. Now lets compute ‘dw’ <em>directly</em>:</p><p>To compute <a href="https://math.stackexchange.com/questions/2200339/why-are-terms-flipped-in-partial-derivative-of-logistic-regression-cost-function?noredirect=1&amp;lq=1"><strong>directly</strong></a>, we first take our cost function</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/500/1*yastBIEP77h5PxFhuLyyjA.png" /><figcaption>Cross Entropy cost function</figcaption></figure><p>We can notice that the first log term ‘ln(a)’ can be expanded to</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/215/1*rnCRIgDtgF9ff0qbfaS4zQ.png" /><figcaption>expanding ‘ln(a)’</figcaption></figure><p>Which simplifies to:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/205/1*23svBASU1mfPyRo5FohY6g.png" /></figure><p>And if we take the second log function ‘ln(1-a)’ which can be shown as</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/181/1*VaNR87Ss6UKTYwzyTVtXBg.png" /><figcaption>ln(1-a)</figcaption></figure><p>taking the log of the numerator ( we will leave the denominator) we get</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/299/1*u-0aAseYLJaB3zDeGqcQ-A.png" /><figcaption>log of the numerator</figcaption></figure><p>This result comes from the <a href="https://people.richland.edu/james/lecture/m116/logs/properties.html">rule of logs</a>, which states: log(p/q) = log(p) — log(q).</p><p>Plugging these formula back into our original cost function we get</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/698/1*od0TgKCRaIB_rbL0rmqkUA.png" /><figcaption>plugged back into cost function</figcaption></figure><p>Expanding the term in the square brackets we get</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/698/1*Ni3nw777S10JOOcuRz-FQA.png" /><figcaption>terms inside bracket expanded</figcaption></figure><p>The first and last terms ‘yln(1+e^-z)’ cancel out leaving:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/290/1*d2XoNY3_WQaqkabOGjrp3w.png" /></figure><p>Which we can rearrange by pulling the ‘yz’ term to the outside to give</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/288/1*eErHdOn075ketIbWaDfDYw.png" /></figure><p><a href="https://math.stackexchange.com/questions/477207/derivative-of-cost-function-for-logistic-regression">Here’s where it gets interesting</a>, by adding an exp term to the ‘z’ inside the square brackets and then immediately taking its log</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/348/1*INdtRwDXab6DGWAaxcUHmw.png" /><figcaption>we exponentiate ‘e^z’ then take its log</figcaption></figure><p>next we can take advantage of the rule of sum of logs: ln(a) + ln(b) = ln(a.b) combined with <a href="https://www.rapidtables.com/math/number/exponent.html">rule of exp</a> products:e^a * e^b = e^(a+b) to get</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/312/0*vibld17a0decPWpl" /><figcaption>ln(a) + ln(b) = ln(a.b)</figcaption></figure><p>followed by</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/216/1*b5e7hQQhaUmyJ5nhge5ycQ.png" /><figcaption>add e^(z +-z)</figcaption></figure><p>Pulling the ‘yz’ term inside the brackets we get :</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/214/1*4Ssc6ZC5jIJ6frK_paHjSA.png" /></figure><p>Finally we note that z = Wx+b therefore taking the derivative w.r.t W:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/254/1*QY9qWlmyHomXLFHM5JjuDA.png" /><figcaption>take derivative w.r.t W</figcaption></figure><p>The first term ‘yz ’becomes ‘yx ’and the second term becomes :</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/138/1*Esa8tUE6m8OopzwqZnAoMQ.png" /><figcaption>taking derivative of logs again</figcaption></figure><p>Note that the 2nd term is nothing but</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/158/0*jSx2tE0QrSiegCZt" /></figure><p>Which gives a final result of</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/240/1*HLj5aJvLStxxQ6hKmdmuhA.png" /></figure><p>We can rearrange by pulling ‘x’ out to give</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/202/1*F0VbwuBH2BnbEWen2c-cng.png" /></figure><p>which gives</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/202/1*idsOS879a975Ph6lmRJ0Hw.png" /><figcaption>final result</figcaption></figure><h3>[6] derivative w.r.t bias</h3><figure><img alt="" src="https://cdn-images-1.medium.com/max/45/1*EzAUoYXl4s3WBzweRoPH5w.png" /><figcaption>[6] derivative w.r.t bias b</figcaption></figure><p>Again we could use <strong>chain rule</strong> which would be</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/124/1*3PzykxuCGOJtSKXxdaPc6A.png" /><figcaption>chain rule for ‘db’</figcaption></figure><p>This is easy to solve as we already computed ‘dz’ and the second term is simply the derivative of ‘z’ which is ‘wX +b’ w.r.t ‘b’ which is simply 1!</p><p>so the derivative w.r.t b is simply</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/44/1*aflWeoRQggpkanuV8Ll_vA.png" /></figure><p>which we already calculated earlier as</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/89/1*x1CdzcXePWPbe_UDFesVJQ.png" /><figcaption>derivative of our linear function (z = wX +b)</figcaption></figure><p>For completeness we will also show how to calculate ‘db’ <strong>directly</strong>. To calculate this we will take a step from the above calculation for ‘dw’, (from just before we did the differentiation)</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/214/1*4Ssc6ZC5jIJ6frK_paHjSA.png" /><figcaption>note: z = wX + b</figcaption></figure><p>remembering that z = wX +b and we are trying to find derivative of the function w.r.t b, if we take the derivative w.r.t b from both terms ‘yz’ and ‘ln(1+e^z)’ we get</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/573/0*VgWLH92SBmRefdWY" /><figcaption>note the parenthesis</figcaption></figure><p>its important to note the parenthesis here, as it clarifies how we get our derivative.</p><p>Taking the LHS first, the derivative of ‘wX’ w.r.t ‘b’ is zero as it doesn’t contain b! The derivative of ‘b’ is simply 1, so we are just left with the ‘y’ outside the parenthesis.</p><p>for the RHS, we do the same as we did when calculating ‘dw’, except this time when taking derivative of the inner function ‘e^wX+b’ we take it w.r.t ‘b’ (instead of ‘w’) which gives the following result (this is because the derivative w.r.t in the exponent evaluates to 1)</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/183/0*cb4NJfkuwH4sTYHT" /><figcaption>derivative of ln(1+e^wX+b)</figcaption></figure><p>this term is simply our original</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/22/0*26FBFOlqMNtPj859" /></figure><p>so putting the whole thing together we get</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/117/0*gKj2h5-ZDBFlawt1" /><figcaption>final result</figcaption></figure><p>which we have already show is simply ‘dz’!</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/44/1*aflWeoRQggpkanuV8Ll_vA.png" /><figcaption>‘db’ = ‘dz’</figcaption></figure><p>So that concludes all the derivatives of our Neural Network. <strong>We have calculated all of the following:</strong></p><figure><img alt="" src="https://cdn-images-1.medium.com/max/212/1*9nTTDAgPgYsuXO4DGAKyFA.png" /><figcaption>The derivatives of L(a,y) w.r.t each element in our NN</figcaption></figure><h3>Wrapping up</h3><p>And what about the result:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/467/1*qGRGLI_8rT8gz9gIGWLyaQ.png" /></figure><p>well, we can unpack the chain rule to explain:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/358/1*XfxKWSUDWtEtpRMkkUfaFQ.png" /><figcaption>‘dz’ using chain rule</figcaption></figure><p>Note that the term</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/125/1*Kwvases1XeQ7LbO4LrQGig.png" /></figure><p>is simply ‘dz’ the term we calculated earlier:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/44/1*aflWeoRQggpkanuV8Ll_vA.png" /></figure><p>and the term</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/73/1*86v4ry31gCoIC1KAIoX4LQ.png" /></figure><p>evaluates to W[l] or in other words, the derivative of our linear function Z =’Wa +b’ w.r.t ‘a’ equals ‘W’.</p><p>and finally the term in blue</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/75/1*8XKwzs1j6A6ySKPuKWcS-g.png" /></figure><p>is simply</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/44/1*H5S49BM3FLX6bM-kdRE5xw.png" /><figcaption>[2] derivative of sigmoid</figcaption></figure><p>‘da/dz’ the derivative of the the sigmoid function that we calculated earlier!</p><p>As a final note on the notation used in the Coursera Deep Learning course, in the result</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/458/1*3FhcaGe98ZyPCGHMbUXF8Q.png" /></figure><p>we perform element wise multiplication between DZ and g’(Z), this is to ensure that all the dimensions of our matrix multiplications match up as expected.</p><h3>So there we have it…</h3><p>… all the derivatives required for backprop as shown in Andrew Ng’s Deep Learning course.</p><p>Simply reading through these calculus calculations (<em>or any others for that matter</em>) won’t be enough to make it stick in your mind. The best way to learn is to lock yourself in a room and <strong>practice, practice, practice!</strong></p><h3>What next?</h3><blockquote>If you got something out of this post, please share with others who may benefit, follow me <a href="https://medium.com/u/55da8ebe8a7f">Patrick David</a> for more ML posts or on twitter @pdquant and give it a cynical/pity/genuine round of <strong>applause</strong>!</blockquote><h4>Stocks Significance Testing &amp; p-Hacking</h4><p><a href="https://medium.com/@pdquant/stocks-significance-testing-p-hacking-how-volatile-is-volatile-1a0da3064b8a">Stocks, Significance Testing &amp; p-Hacking: How volatile is volatile?</a></p><h4>Build a Bit(Cointegration) Backtester</h4><p><a href="https://medium.com/@pdquant/build-a-bitcoin-tegration-backtester-83e2b19125fd">Build a BitCoin(tegration) Backtester</a></p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=d5275f727f60" width="1" height="1" alt="">]]></content:encoded>
        </item>
    </channel>
</rss>