<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:cc="http://cyber.law.harvard.edu/rss/creativeCommonsRssModule.html">
    <channel>
        <title><![CDATA[Stories by Ryan Burn on Medium]]></title>
        <description><![CDATA[Stories by Ryan Burn on Medium]]></description>
        <link>https://medium.com/@ryan.burn?source=rss-f55ad0a8217------2</link>
        <image>
            <url>https://cdn-images-1.medium.com/fit/c/150/150/1*v2jvthQJ4-zUQTzn5JlLfg.jpeg</url>
            <title>Stories by Ryan Burn on Medium</title>
            <link>https://medium.com/@ryan.burn?source=rss-f55ad0a8217------2</link>
        </image>
        <generator>Medium</generator>
        <lastBuildDate>Wed, 08 Apr 2026 00:32:16 GMT</lastBuildDate>
        <atom:link href="https://medium.com/@ryan.burn/feed" rel="self" type="application/rss+xml"/>
        <webMaster><![CDATA[yourfriends@medium.com]]></webMaster>
        <atom:link href="http://medium.superfeedr.com" rel="hub"/>
        <item>
            <title><![CDATA[Constructing Default Priors for Bayesian Hypothesis Testing: A Reference Analysis Approach]]></title>
            <description><![CDATA[<div class="medium-feed-item"><p class="medium-feed-image"><a href="https://medium.com/data-science-collective/constructing-default-priors-for-bayesian-hypothesis-testing-a-reference-analysis-approach-843166ff09b6?source=rss-f55ad0a8217------2"><img src="https://cdn-images-1.medium.com/max/1920/1*h6tZ6ynIZ6by1zICjcwzjw.jpeg" width="1920"></a></p><p class="medium-feed-snippet">Consider these birth records from Paris collected between 1745 and 1770:</p><p class="medium-feed-link"><a href="https://medium.com/data-science-collective/constructing-default-priors-for-bayesian-hypothesis-testing-a-reference-analysis-approach-843166ff09b6?source=rss-f55ad0a8217------2">Continue reading on Data Science Collective »</a></p></div>]]></description>
            <link>https://medium.com/data-science-collective/constructing-default-priors-for-bayesian-hypothesis-testing-a-reference-analysis-approach-843166ff09b6?source=rss-f55ad0a8217------2</link>
            <guid isPermaLink="false">https://medium.com/p/843166ff09b6</guid>
            <category><![CDATA[probability]]></category>
            <category><![CDATA[mathematics]]></category>
            <category><![CDATA[data-science]]></category>
            <category><![CDATA[statistics]]></category>
            <category><![CDATA[bayesian-statistics]]></category>
            <dc:creator><![CDATA[Ryan Burn]]></dc:creator>
            <pubDate>Mon, 31 Mar 2025 02:25:41 GMT</pubDate>
            <atom:updated>2025-03-31T02:25:41.915Z</atom:updated>
        </item>
        <item>
            <title><![CDATA[How to Apply the Central Limit Theorem to Constrained Data]]></title>
            <description><![CDATA[<div class="medium-feed-item"><p class="medium-feed-image"><a href="https://medium.com/data-science/how-to-apply-the-central-limit-theorem-to-constrained-data-3dbc20bceeaa?source=rss-f55ad0a8217------2"><img src="https://cdn-images-1.medium.com/max/1500/1*tv870pIfZMm0fv7c9AB-9A.png" width="1500"></a></p><p class="medium-feed-snippet">What can we say about the mean of data distributed in an interval [a, b]?</p><p class="medium-feed-link"><a href="https://medium.com/data-science/how-to-apply-the-central-limit-theorem-to-constrained-data-3dbc20bceeaa?source=rss-f55ad0a8217------2">Continue reading on TDS Archive »</a></p></div>]]></description>
            <link>https://medium.com/data-science/how-to-apply-the-central-limit-theorem-to-constrained-data-3dbc20bceeaa?source=rss-f55ad0a8217------2</link>
            <guid isPermaLink="false">https://medium.com/p/3dbc20bceeaa</guid>
            <category><![CDATA[thoughts-and-theory]]></category>
            <category><![CDATA[statistics]]></category>
            <category><![CDATA[probability]]></category>
            <category><![CDATA[bayesian-statistics]]></category>
            <category><![CDATA[mathematics]]></category>
            <dc:creator><![CDATA[Ryan Burn]]></dc:creator>
            <pubDate>Tue, 10 Dec 2024 19:58:16 GMT</pubDate>
            <atom:updated>2024-12-10T20:04:04.679Z</atom:updated>
        </item>
        <item>
            <title><![CDATA[Using Objective Bayesian Inference to Interpret Election Polls]]></title>
            <description><![CDATA[<div class="medium-feed-item"><p class="medium-feed-image"><a href="https://medium.com/data-science/using-objective-bayesian-inference-to-interpret-election-polls-3de2d4354989?source=rss-f55ad0a8217------2"><img src="https://cdn-images-1.medium.com/max/2048/1*nl0FEf6CXz-SjjjWsDeE2w.png" width="2048"></a></p><p class="medium-feed-snippet">How to build a polls-only objective Bayesian model that goes from a state polling lead to probability of winning the state</p><p class="medium-feed-link"><a href="https://medium.com/data-science/using-objective-bayesian-inference-to-interpret-election-polls-3de2d4354989?source=rss-f55ad0a8217------2">Continue reading on TDS Archive »</a></p></div>]]></description>
            <link>https://medium.com/data-science/using-objective-bayesian-inference-to-interpret-election-polls-3de2d4354989?source=rss-f55ad0a8217------2</link>
            <guid isPermaLink="false">https://medium.com/p/3de2d4354989</guid>
            <category><![CDATA[statistics]]></category>
            <category><![CDATA[machine-learning]]></category>
            <category><![CDATA[elections]]></category>
            <category><![CDATA[data-science]]></category>
            <category><![CDATA[bayesian-statistics]]></category>
            <dc:creator><![CDATA[Ryan Burn]]></dc:creator>
            <pubDate>Wed, 30 Oct 2024 00:16:17 GMT</pubDate>
            <atom:updated>2024-10-30T00:16:17.448Z</atom:updated>
        </item>
        <item>
            <title><![CDATA[Comparing Sex Ratios: Revisiting a Famous Statistical Problem from the 1700s]]></title>
            <description><![CDATA[<div class="medium-feed-item"><p class="medium-feed-image"><a href="https://medium.com/data-science/comparing-sex-ratios-revisiting-a-famous-statistical-problem-from-the-1700s-720cd57872c6?source=rss-f55ad0a8217------2"><img src="https://cdn-images-1.medium.com/max/1600/1*NzY7HeG_u8CGfX8nbeDOCQ.png" width="1600"></a></p><p class="medium-feed-snippet">What can we say about the difference of two binomial distribution probabilities</p><p class="medium-feed-link"><a href="https://medium.com/data-science/comparing-sex-ratios-revisiting-a-famous-statistical-problem-from-the-1700s-720cd57872c6?source=rss-f55ad0a8217------2">Continue reading on TDS Archive »</a></p></div>]]></description>
            <link>https://medium.com/data-science/comparing-sex-ratios-revisiting-a-famous-statistical-problem-from-the-1700s-720cd57872c6?source=rss-f55ad0a8217------2</link>
            <guid isPermaLink="false">https://medium.com/p/720cd57872c6</guid>
            <category><![CDATA[mathematics]]></category>
            <category><![CDATA[statistics]]></category>
            <category><![CDATA[data-science]]></category>
            <category><![CDATA[probability]]></category>
            <category><![CDATA[bayesian-statistics]]></category>
            <dc:creator><![CDATA[Ryan Burn]]></dc:creator>
            <pubDate>Fri, 09 Aug 2024 20:04:03 GMT</pubDate>
            <atom:updated>2024-08-09T20:04:03.476Z</atom:updated>
        </item>
        <item>
            <title><![CDATA[How to Efficiently Approximate a Function of One or More Variables]]></title>
            <description><![CDATA[<div class="medium-feed-item"><p class="medium-feed-image"><a href="https://medium.com/data-science/how-to-efficiently-approximate-a-function-of-one-or-more-variables-fc702c9c9431?source=rss-f55ad0a8217------2"><img src="https://cdn-images-1.medium.com/max/1400/1*gOwWFJqYxETyYdLWgrzkaw.png" width="1400"></a></p><p class="medium-feed-snippet">Use sparse grids and Chebyshev interpolants to build accurate approximations to multivariable functions.</p><p class="medium-feed-link"><a href="https://medium.com/data-science/how-to-efficiently-approximate-a-function-of-one-or-more-variables-fc702c9c9431?source=rss-f55ad0a8217------2">Continue reading on TDS Archive »</a></p></div>]]></description>
            <link>https://medium.com/data-science/how-to-efficiently-approximate-a-function-of-one-or-more-variables-fc702c9c9431?source=rss-f55ad0a8217------2</link>
            <guid isPermaLink="false">https://medium.com/p/fc702c9c9431</guid>
            <category><![CDATA[math]]></category>
            <category><![CDATA[algorithms]]></category>
            <category><![CDATA[numerical-analysis]]></category>
            <category><![CDATA[statistics]]></category>
            <category><![CDATA[deep-dives]]></category>
            <dc:creator><![CDATA[Ryan Burn]]></dc:creator>
            <pubDate>Fri, 28 Jun 2024 23:05:11 GMT</pubDate>
            <atom:updated>2024-06-29T03:24:17.837Z</atom:updated>
        </item>
        <item>
            <title><![CDATA[Introduction to Objective Bayesian Hypothesis Testing]]></title>
            <description><![CDATA[<div class="medium-feed-item"><p class="medium-feed-image"><a href="https://medium.com/data-science/introduction-to-objective-bayesian-hypothesis-testing-06c9e98eb90b?source=rss-f55ad0a8217------2"><img src="https://cdn-images-1.medium.com/max/1600/1*_tevXbtYsWCPtFSLuXw2Iw.jpeg" width="1600"></a></p><p class="medium-feed-snippet">How to Derive Posterior Probabilities for Hypotheses using Default Bayes Factors</p><p class="medium-feed-link"><a href="https://medium.com/data-science/introduction-to-objective-bayesian-hypothesis-testing-06c9e98eb90b?source=rss-f55ad0a8217------2">Continue reading on TDS Archive »</a></p></div>]]></description>
            <link>https://medium.com/data-science/introduction-to-objective-bayesian-hypothesis-testing-06c9e98eb90b?source=rss-f55ad0a8217------2</link>
            <guid isPermaLink="false">https://medium.com/p/06c9e98eb90b</guid>
            <category><![CDATA[bayesian-statistics]]></category>
            <category><![CDATA[probability]]></category>
            <category><![CDATA[data-science]]></category>
            <category><![CDATA[deep-dives]]></category>
            <category><![CDATA[statistics]]></category>
            <dc:creator><![CDATA[Ryan Burn]]></dc:creator>
            <pubDate>Tue, 11 Jun 2024 23:50:13 GMT</pubDate>
            <atom:updated>2024-06-11T23:50:13.700Z</atom:updated>
        </item>
        <item>
            <title><![CDATA[An Introduction to Objective Bayesian Inference]]></title>
            <description><![CDATA[<div class="medium-feed-item"><p class="medium-feed-image"><a href="https://medium.com/data-science/an-introduction-to-objective-bayesian-inference-cc20c1a0836e?source=rss-f55ad0a8217------2"><img src="https://cdn-images-1.medium.com/max/1500/1*g_IX7wE6-KFSWKWQ9_Sxwg.png" width="1500"></a></p><p class="medium-feed-snippet">How to calculate probability when &#x201C;we absolutely know nothing antecedently to any trials made&#x201D; (Bayes, 1763)</p><p class="medium-feed-link"><a href="https://medium.com/data-science/an-introduction-to-objective-bayesian-inference-cc20c1a0836e?source=rss-f55ad0a8217------2">Continue reading on TDS Archive »</a></p></div>]]></description>
            <link>https://medium.com/data-science/an-introduction-to-objective-bayesian-inference-cc20c1a0836e?source=rss-f55ad0a8217------2</link>
            <guid isPermaLink="false">https://medium.com/p/cc20c1a0836e</guid>
            <category><![CDATA[mathematics]]></category>
            <category><![CDATA[bayesian-statistics]]></category>
            <category><![CDATA[deep-dives]]></category>
            <category><![CDATA[probability]]></category>
            <category><![CDATA[statistics]]></category>
            <dc:creator><![CDATA[Ryan Burn]]></dc:creator>
            <pubDate>Tue, 23 Apr 2024 04:30:06 GMT</pubDate>
            <atom:updated>2024-05-24T05:06:27.655Z</atom:updated>
        </item>
        <item>
            <title><![CDATA[Logistic Regression and the Missing Prior]]></title>
            <link>https://medium.com/data-science/logistic-regression-and-the-missing-prior-7c36039af851?source=rss-f55ad0a8217------2</link>
            <guid isPermaLink="false">https://medium.com/p/7c36039af851</guid>
            <category><![CDATA[logistic-regression]]></category>
            <category><![CDATA[machine-learning]]></category>
            <category><![CDATA[statistics]]></category>
            <dc:creator><![CDATA[Ryan Burn]]></dc:creator>
            <pubDate>Thu, 10 Mar 2022 14:22:57 GMT</pubDate>
            <atom:updated>2022-03-12T22:50:57.270Z</atom:updated>
            <content:encoded><![CDATA[<h4>How to reduce the bias of logistic regression using Jeffreys Prior</h4><p>Suppose <strong>X</strong> denotes a matrix of regressors and <strong>y</strong>,</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/600/1*wkRADfHKAZrkA5Es4kGXVw@2x.png" /></figure><p>denotes a vector of target values.</p><p>If we hypothesize that the data is generated from a logistic regression model, then our belief in weights <strong>w</strong> after seeing the data is given by</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/600/1*j1YYoaI099Gv-MOLIqgoWQ@2x.png" /></figure><p>Here, P(<strong>y</strong>|<strong>X</strong>, <strong>w</strong>) is called the likelihood and is given by</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/600/1*O6QjgRL5Ux2BmjuMD_b3uQ@2x.png" /></figure><p>And P(<strong>w</strong>) is called the prior and quantifies our belief in <strong>w</strong> before seeing data.</p><p>Frequently, logistic regression is fit to maximize P(<strong>y</strong>|<strong>X</strong>, <strong>w</strong>) without regard to the prior</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/600/1*ZQ6VQTWIg9POTDjQlyZlAA@2x.png" /></figure><p>and <strong>w</strong>_ML is what we call a <em>maximum likelihood</em> estimate for the weights.</p><p>But disregarding the prior is generally a bad idea. Let’s see why.</p><h4>What’s Wrong with Maximum Likelihood Models</h4><p>Consider a concrete example.</p><p>Suppose you’re a biologist researching wolves. As part of your work, you’ve been tracking wolf packs and observing their hunting behavior.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*3rJPM84L8qsYaC3GMdVCqw.jpeg" /><figcaption>Photo by <a href="https://unsplash.com/@bonopeppers?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText">Thomas Bonometti</a> on <a href="https://unsplash.com/s/photos/wolf-pack?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText">Unsplash</a></figcaption></figure><p>You’d like to understand what makes a hunt successful. To that end, you fit a logistic regression model to predict whether a hunt will succeed or fail. It would be interesting to consider factors such as pack size and prey type, but to keep it simple you start with only a bias term.</p><p>Let <em>n</em> denote the total number of hunts. If <em>b</em> denotes the value of our<br>bias term, then the likelihood of observing <em>k</em> successful hunts is given by</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/600/1*ktYaxZbGaDyl7WCJrcypaw@2x.png" /></figure><p>Now, let’s suppose that we observe 8 hunts and record no successful hunts. (The hunter success rate for wolves is actually quite low, so such a result would not be surprising). What would be a good working estimate for <em>b</em>? The maximum likelihood estimate would be -∞, yet we know that wouldn’t make sense… why bother hunting at all if there’s no chance of success.</p><p>We can obtain a better working estimate if instead of selecting <em>b</em> to maximize likelihood, we select <em>b</em> to maximize the probability of the posterior distribution</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/600/1*bKMqOZYVyLtZn9_2y6U32A@2x.png" /></figure><p>with a suitably chosen prior P(b). We’ll discuss later the process for selecting priors but for now consider the prior</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/600/1*RcPZossYhsTZ7aNDN3F0lw@2x.png" /></figure><p>The posterior for this prior is</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/600/1*SoenVLOYFngc-7SvOdslRg@2x.png" /></figure><p>Put p = (1 + exp(-b))⁻¹. We can find the MAP estimate by taking the logarithm, differentiating, and solving for 0</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/600/1*h2lVhaAHCNTxfXaunL5sfg@2x.png" /></figure><p>Solving for 0, we have</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/600/1*2fXq9vVpccYeg3fTnWd_2g@2x.png" /></figure><p>In our example, this then gives us the much more reasonable estimate</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/600/1*vHYDrAK2fc-O5cAz8NRPZA@2x.png" /></figure><p>The prior we used is an example of a noninformative prior.</p><h4>What Are Noninformative Priors and How Do We Find Them</h4><p>Note: The presentation here of noninformative priors closely follows chapter 2 of Box &amp; Tiao’s book <a href="https://onlinelibrary.wiley.com/doi/book/10.1002/9781118033197">Bayesian Inference in Statistical Analysis</a>.</p><p>Suppose we’re given a likelihood function L(θ|<strong>y</strong>) for a model with parameters θ and data <strong>y</strong>. What’s a good prior to use when we have no particular knowledge about θ?</p><p>Let’s look at a concrete example. Suppose data is generated from a normal distribution with known variance σ² but unknown mean θ. The likelihood function for this model is</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/600/1*usqPd4tHm3EfvIA_dOMUgg@2x.png" /></figure><p>And here’s a plot of the likelihood for n=5 and different values of y̅:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/576/1*NeHKPdcJRZsXuwmBfoly-g.png" /><figcaption>Likelihood function for the mean of normally distributed data with variance one, <br> n=5, and different values of y̅. Image by author. Image by author.</figcaption></figure><p>Note how the likelihood curves have the same shape for all values of y̅ and vary only by translation. When the likelihood curve has this property we say that it is <em>data translated</em>.</p><p>Now consider two ranges of the same size [a, b] and [a + t, b + t]. Both would be equivalently placed relative to the likelihood curve if the observed data translates by y̅+t instead of y̅. Thus, we should have</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/600/1*eMO-OxArFVHuMlbiiT-MaQ@2x.png" /></figure><p>or the uniform prior</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/600/1*HXrXQwNYO3lyVKx0zREKzg@2x.png" /></figure><p>Suppose we use an alternative parameterization for a likelihood function L(θ|<strong>y</strong>): We parameterize by <em>u</em> = ϕ(θ) and apply the uniform prior to <em>u</em> where ϕ is some monotonic injective transformation. The cumulative distribution function is then given by</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/600/1*KCgKiUTNIaHxIb81rkAdsA@2x.png" /></figure><p>where <em>Z</em> is some normalizing constant. Making the substitution <em>θ = ϕ⁻¹(u) </em>to go back to the original parameterization then gives us</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/600/1*CIInDK3my_7u6xMRpXln7A@2x.png" /></figure><p>So we see that a uniform prior over <em>u</em> is equivalent to a prior <em>P(θ) ∝ ϕ’(θ)</em> over <em>θ</em>. And finding a noninformative prior for θ is equivalent to finding a transformation <em>ϕ</em> that makes the likelihood function data translated.</p><p>Let’s now return to the likelihood function for a logistic regression model with only a bias term</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/600/1*TBWg6XJrOSeUaUDbwYAPpg@2x.png" /></figure><p>Plotting the likelihood for <em>n=8</em> and different values of <em>k</em> gives us</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/576/1*MRjEmKb_JmkZrpXOujsIbw.png" /><figcaption>Likelihood function for a logistic regression model with only a bias term, <br> n=8, and different values of k. Image by author.</figcaption></figure><p>It’s pretty clear from looking at the graph that the likelihood function is not data translated. The curve for <em>k=4</em> is more peaked than the others, and the curves for <em>k=1</em> and <em>k=6</em> are skewed.</p><p>Now, for suitably behaved likelihood functions, we can approximate L(θ|<strong>y</strong>) by<br>a Gaussian using a second-order Taylor series expansion of log L(θ|<strong>y</strong>) about<br>θ_ML</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/600/1*NaLUGMwaXeuTj7cVuqgGrw@2x.png" /></figure><p>where</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/600/1*HHRHf27B-k19bqu3zefMjQ@2x.png" /></figure><p>As n →∞, the approximation becomes more and more accurate.</p><p>Let’s see how this works for our simplified logistic regression model.</p><p>Differentiating log L(b|k) gives us</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/600/1*ExQiYn0nfCjpHtiUGARQSw@2x.png" /></figure><p>Differentiating again gives us</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/600/1*cNIcV45c1Rvoy179GDYlfg@2x.png" /></figure><p>And here’s what the Gaussian approximation looks like</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/576/1*v1K4wKtL5QFVXFc9hEpWaw.png" /><figcaption>Likelihood function for a logistic regression model with only a bias term together<br> with its Gaussian approximation for n=8 and k=1. Image by author.</figcaption></figure><p>Now, consider what happens to our Gaussian approximation of L(θ,<strong>y</strong>) when we reparameterize with θ = ϕ⁻¹(u).</p><p>Differentiating gives us</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/600/1*ADo8qIPG_jc-l741ji-mfw@2x.png" /></figure><p>Differentiating again gives us</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/600/1*xVr3Iww6Uiv_JuZm2F7_IA@2x.png" /></figure><p>Evaluating at a maximum u_ML= ϕ(θ_ML), we have</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/600/1*-vjdBKcPZQ-v4B2UqWpZrw@2x.png" /></figure><p>where we’ve applied</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/600/1*2kcsvxh82p0f8t2SBPDtLA@2x.png" /></figure><p>Returning to logistic regression, put</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/600/1*Z-VyKuh1sECeypXVnb34Wg@2x.png" /></figure><p>Then we have</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/600/1*u8o-w5mahEcFhEEATUuakQ@2x.png" /></figure><p>Thus, the Gaussian approximation about u_ML becomes</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/600/1*HLFFSv9Tdcu7qdRG3b6prA@2x.png" /></figure><p>so that the likelihood function is now approximately data translated.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/576/1*vEDnGKDzSR01R7C68uMoDQ.png" /><figcaption>Reparameterized likelihood function for a logistic regression model with only a bias term, <br> n=8, and different values of k. Image by author.</figcaption></figure><h4>Jeffreys Prior</h4><p>Let’s now give a brief sketch of Jeffreys prior for the multiparameter case.<br>Let <strong>θ</strong> = (θ₁, …, θ_k)^T denote the vector of parameters.</p><p>Similar to the single parameter case, the likelihood function function can be approximated by a Gaussian about the optimum</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/600/1*KD-cFLr2lHI8YG_9vd9rBg@2x.png" /></figure><p>where</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/600/1*jrlDPvAx5HL4S8onXMxywg@2x.png" /></figure><p>Now, suppose we reparameterize with <strong>u</strong>= ϕ(<strong>θ</strong>). Then the updated hessian for the Gaussian approximation becomes</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/600/1*gAF64IQy91pxeDCAVVn5pw@2x.png" /></figure><p>where</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/600/1*O70PBuJKI9XLJvZOQqeMMA@2x.png" /></figure><p>In the general multivariable case, we may not be able to reparameterize in such a way the Gaussian approximation becomes data translated, but suppose we select ϕ so that</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/600/1*8HcD565cQBZ918PV2cbbVQ@2x.png" /></figure><p>Then</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/600/1*FlLl8vXc4MgE7AgYJdjNdw@2x.png" /></figure><p>Having a constant determinant, means that the volume of regions will be preserved.</p><p>Let S ∈ ℝᵏ be a region about the origin and <strong>K</strong> be the covariance matrix<br>of the reparameterized Gaussian approximation about <strong>θ</strong>_ML. If L is the Cholesky factorization of <strong>K</strong></p><figure><img alt="" src="https://cdn-images-1.medium.com/max/600/1*xoYjznAFNIlV96WhRVZG1g@2x.png" /></figure><p>then S corresponds to the region L⁻¹S+<strong>θ</strong>_ML about <strong>θ</strong>_ML that has fixed probability mass and volume</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/600/1*oTFJJW7Ot0TLxcqYQM9DYw@2x.png" /></figure><p>The corresponding prior for <strong>θ</strong> of a uniform prior on <strong>u</strong> is</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/600/1*GBJN1ZNj3ojpCMRg2UwiZQ@2x.png" /></figure><p>And this is called <em>Jeffreys prior</em>.</p><p>Let’s apply this to the full binomial logistic regression model with weights. The negative log likelihood function is given by</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/600/1*DaNNHFGE1Zhqqws2iV662w@2x.png" /></figure><p>where</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/600/1*5Xq_qtqp4BBpjaYc1dTU7Q@2x.png" /></figure><p>Differentiating gives us</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/600/1*-eAnAg3B7z5jxqwBlWz-4g@2x.png" /></figure><p>where</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/600/1*Oh3FlFu4EddA4YWDPv8hqg@2x.png" /></figure><p>Differentiating again gives us</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/600/1*XqV3Ex1W5Dajj3piIvQ4yw@2x.png" /></figure><p>where</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/600/1*TrNN1BiqW7wpJB_ubF9KRA@2x.png" /></figure><p>We can write the hessian as</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/600/1*vEPdQcTggsGhcTogm7V6Fw@2x.png" /></figure><p>where <strong>A</strong> is the diagonal matrix with</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/600/1*tVpOs2CqOfrsMM6PoUZemA@2x.png" /></figure><p>The Jeffreys prior is then</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/600/1*zlk_0S3TuFVMHlW5WUxJIA@2x.png" /></figure><h4>Fitting Logistic Regression with Jeffreys Prior</h4><p>Finding the MAP model for logistic regression with the Jeffreys prior amounts to solving the optimization problem</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/600/1*Pf0-EAxODHGuueEKwm9gIw@2x.png" /></figure><p>Let π denote the log of Jeffreys prior</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/600/1*mYVZpKgj1vH1WxsvvLXa9Q@2x.png" /></figure><p>If we can find equations for the gradient and hessian of π, then we can apply standard second-order optimization algorithms to find <strong>w</strong>_MAP.</p><p>Starting with the gradient, we can apply <a href="https://en.wikipedia.org/wiki/Jacobi%27s_formula">Jacobi’s Formula</a> for differentiating a determinant</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/600/1*8xP9teC4NGL_MHky-A_2vQ@2x.png" /></figure><p>to get</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/600/1*fP4jIuglsUKTCzsL-EKRsA@2x.png" /></figure><p>where the derivative of <strong>A_w</strong> is the diagonal matrix</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/600/1*W3tgymtPaaxSnZYzvDq_SQ@2x.png" /></figure><p>For the hessian, we apply the formula for differentiating an inverse matrix</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/600/1*2g1jy28fSoXJarN2TskcAw@2x.png" /></figure><p>to get</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/600/1*OlTuYJvbH3BUeoJSp4GOZA@2x.png" /></figure><p>where</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/600/1*2ewsrqy_otYaDPMYQjPxEw@2x.png" /></figure><p>By evaluation the gradient and hessian computations in an efficient manner and applying a second-order trust region optimizer, we can then quickly iterate to find <strong>w</strong>_MAP. A Python package using this approach is available at <a href="https://github.com/rnburn/bbai">https://github.com/rnburn/bbai</a>.</p><p>Let’s look at how this works on some examples.</p><h4>Example 1: Simulated Data with Single Variable</h4><p>We’ll start by looking at the prior and posterior for a simulation dataset with<br>a single regressor and no bias.</p><p>We begin by generating the dataset.</p><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/ab41baf98cdbb6e1e8fb9cdb42977970/href">https://medium.com/media/ab41baf98cdbb6e1e8fb9cdb42977970/href</a></iframe><p>We next add a function to compute the prior</p><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/c312db6a2e9fdd693362ef93fe87d2e8/href">https://medium.com/media/c312db6a2e9fdd693362ef93fe87d2e8/href</a></iframe><p>And plot it out across a range of values</p><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/076336cc5d41b3353a2a0c22e4aecf2b/href">https://medium.com/media/076336cc5d41b3353a2a0c22e4aecf2b/href</a></iframe><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*QNtACehncdAG1Q96VwnJtg.png" /><figcaption>Jeffreys prior for logistic regression on a simulation dataset with a single regressor. Image by author.</figcaption></figure><p>Next, we fit a logistic regression MAP model and compare w_MAP to<br>w_true.</p><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/0d11a1b2f761f2497d893a8b8c3bb849/href">https://medium.com/media/0d11a1b2f761f2497d893a8b8c3bb849/href</a></iframe><p>Prints</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/600/1*9fEOzyWxILEMbJaEArFD7g@2x.png" /></figure><p>And finally, we plot out the posterior distribution.</p><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/b4731ea0202878d473da9cef413a2401/href">https://medium.com/media/b4731ea0202878d473da9cef413a2401/href</a></iframe><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*k7ZwoJpseJ4h3UtO2ucVTQ.png" /><figcaption>Posterior for logistic regression on a simulation dataset with a single regressor together with w_MAP. Image by author.</figcaption></figure><p>The full example is available at <a href="https://github.com/rnburn/bbai/blob/master/example/05-jeffreys1.ipynb">github.com/rnburn/bbai/example/05-jeffreys1.ipynb</a>.</p><h4>Example 2: Simulated Data with Two Variables</h4><p>Let’s look next at an example with two variables.</p><p>We generate a data set with two variables</p><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/50b2db79d99e13fe0fd58eb9251af99f/href">https://medium.com/media/50b2db79d99e13fe0fd58eb9251af99f/href</a></iframe><p>We compute the prior and plot it across a range of values</p><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/23faf03e404949cb3bbecb8bdd07b3be/href">https://medium.com/media/23faf03e404949cb3bbecb8bdd07b3be/href</a></iframe><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*cL0a2UhlFco7KINdDsYb_g.png" /><figcaption>Jeffreys prior for logistic regression on a simulation dataset with two regressors. Image by author.</figcaption></figure><p>Now, we’ll fit a model to find <strong>w</strong>_MAP.</p><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/d57cd25440d990a20ea11af0707d73ff/href">https://medium.com/media/d57cd25440d990a20ea11af0707d73ff/href</a></iframe><p>Prints</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/600/1*ow_WWP-a0A9ub8u6gnl-LA@2x.png" /></figure><p>And finally, we plot out the posterior distribution centered at <strong>w</strong>_MAP.</p><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/deb66ddc35bf085339667ed8df7b8ff9/href">https://medium.com/media/deb66ddc35bf085339667ed8df7b8ff9/href</a></iframe><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*loTDDwnkOTZ5sxYO8BUXTw.png" /><figcaption>Posterior for logistic regression with Jeffreys prior on a simulation dataset with two regressors. Image by author.</figcaption></figure><p>The full example is available at <a href="https://github.com/rnburn/bbai/blob/master/example/06-jeffreys2.ipynb">github.com/rnburn/bbai/example/06-jeffreys2.ipynb</a>.</p><h4>Example 3: Breast Cancer Data Set</h4><p>Now we’ll fit a model to the real-world breast cancer data set.</p><p>We begin by loading and preprocessing the data set.</p><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/67cadff5492ca9b11ef9fe303cde3105/href">https://medium.com/media/67cadff5492ca9b11ef9fe303cde3105/href</a></iframe><p>We then fit a logistic regression MAP model and use the Fisher information matrix to estimate the standard error.</p><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/be593840c91be31e653b112d3c68884e/href">https://medium.com/media/be593840c91be31e653b112d3c68884e/href</a></iframe><p>And we print out the weights along with their standard errors.</p><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/4b60f82c069f46053265658866772a82/href">https://medium.com/media/4b60f82c069f46053265658866772a82/href</a></iframe><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/4a87420ef5f85011a29736712727674d/href">https://medium.com/media/4a87420ef5f85011a29736712727674d/href</a></iframe><p>The full example is available at <a href="https://github.com/rnburn/bbai/blob/master/example/07-jeffreys-breast-cancer.py">github.com/rnburn/bbai/example/07-jeffreys-breast-cancer.py</a>.</p><h4>Conclusion and Further Reading</h4><p>Unlike Ordinary Least Squares, the likelihood function for logistic regression is not data translated. Searching for a data translated transformation leads us to Jeffreys prior, a natural shrinkage prior.</p><p>Firth 1993 [1] and Kosmidis &amp; Firth 2021 [2] analyzed the statistical properties of the estimator that maximizes likelihood with the Jeffreys prior penalization and found it to have smaller asymptotic bias order than the standard maximum likelihood estimator. The reduced bias property of Jeffreys prior combined with its ability to handle the problem of separation [3] can make it an excellent drop in replacement to the standard approach for fitting logistic regression models.</p><p><em>Blog post originally published at </em><a href="https://buildingblock.ai/logistic-regression-jeffreys"><em>https://buildingblock.ai/logistic-regression-jeffreys</em></a><em>.</em></p><h4>References</h4><p>[1] Firth, D. (1993). <a href="https://www2.stat.duke.edu/~scs/Courses/Stat376/Papers/GibbsFieldEst/BiasReductionMLE.pdf">Bias reduction of maximum likelihood estimates</a>. <em>Biometrika </em>80, 27–38.</p><p>[2] Kosmidis, I. &amp; Firth, D. (2021). <a href="https://academic.oup.com/biomet/article/108/1/71/5880219">Jeffreys-prior penalty, finiteness and shrinkage in binomial-response generalized linear models</a>. <em>Biometrika</em>, Volume 108, Issue 1, March 2021, Pages 71–82,</p><p>[3]: Heinze, G. &amp; Schemper, M. (2002). A solution to the problem of separation in logistic regression. <em>Statist. Med. </em>21, 2409–19.</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=7c36039af851" width="1" height="1" alt=""><hr><p><a href="https://medium.com/data-science/logistic-regression-and-the-missing-prior-7c36039af851">Logistic Regression and the Missing Prior</a> was originally published in <a href="https://medium.com/data-science">TDS Archive</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[How to Build a Bayesian Ridge Regression Model with Full Hyperparameter Integration]]></title>
            <link>https://medium.com/data-science/how-to-build-a-bayesian-ridge-regression-model-with-full-hyperparameter-integration-f4ac2bdaf329?source=rss-f55ad0a8217------2</link>
            <guid isPermaLink="false">https://medium.com/p/f4ac2bdaf329</guid>
            <category><![CDATA[statistics]]></category>
            <category><![CDATA[programming]]></category>
            <category><![CDATA[machine-learning]]></category>
            <category><![CDATA[bayesian-statistics]]></category>
            <category><![CDATA[optimization-and-ml]]></category>
            <dc:creator><![CDATA[Ryan Burn]]></dc:creator>
            <pubDate>Wed, 23 Feb 2022 04:53:23 GMT</pubDate>
            <atom:updated>2022-02-23T04:53:23.830Z</atom:updated>
            <content:encoded><![CDATA[<h4>How do we handle the hyperparameter that controls regularization strength?</h4><p><em>In this blog post, we’ll describe an algorithm for Bayesian ridge regression where the hyperparameter representing regularization strength is fully integrated over. An implementation is available at </em><a href="https://github.com/rnburn/bbai"><em>github.com/rnburn/bbai</em></a><em>.</em></p><p>Let <strong>θ</strong> = (σ², <strong>w</strong>) denote the parameters for a linear regression model with weights <strong>w </strong>and normally distributed errors of variance σ².</p><p>If <strong>X</strong> represents an n×p matrix of full rank with p regressors<br>and n rows, then <strong>θ </strong>specifies a probability distribution over possible<br>target values <strong>y</strong></p><figure><img alt="" src="https://cdn-images-1.medium.com/max/600/1*PjgbAebQOcEBqSJuWois9A@2x.png" /></figure><p>Suppose we observe <strong>y</strong> and assume <strong>y</strong> is generated from a linear model of unknown parameters. A Bayesian approach to inference seeks to quantify our belief in the unknown parameters <strong>θ </strong>given the observation.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/600/1*dx1iOuM9hAHcyDERfC4F0g@2x.png" /></figure><p>Applying Bayes’ theorem, we can rewrite the probability as</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/600/1*micV_ZRGAv6uAiCwBDIEag@2x.png" /></figure><p>where we refer to</p><ul><li>P(<strong>θ</strong>|<strong>y</strong>) as the posterior distribution</li><li>P(<strong>y</strong>|<strong>θ</strong>) as the likelihood function</li><li>P(<strong>θ</strong>) the prior distribution</li></ul><p>The prior distribution describes our belief of <strong>θ </strong>before observing <strong>y</strong> and the posterior distribution describes our updated belief after observing <strong>y</strong>.</p><p>Suppose the prior distribution can be expressed as</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/600/1*eXApF2cfAZlRGJ4DcEqwCg@2x.png" /></figure><p>where h(⋅, η)<strong> </strong>denotes a family of probability distributions parameterized<br>by what we call a hyperparameter η.</p><p>Traditional approaches to Bayesian linear regression have used what are called <em>conjugate priors</em>. A family of priors h(⋅, η) is conjugate if the posterior also belongs to the family</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/600/1*b_bgQ1rvU5kOzbpqHw3o_w@2x.png" /></figure><p>Conjugate priors are mathematically convenient as successive observations can be viewed as making successive updates to the parameters of a family of distributions, but requiring h(⋅, η) to be conjugate is a strong assumption.</p><p>Note: See Appendix A for a more detail comparison to other Bayesian algorithms.</p><p>We’ll instead describe an algorithm where</p><ol><li>Priors are selected to shrink <strong>w</strong>, reflecting the prior hypothesis that <strong>w</strong> is not<br>predictive, and be approximately <em>noninformative</em> for other parameters.</li><li>We fully integrate over hyperparameters so that no arbitrary choice of η is required.</li></ol><p>Let’s first consider what it means for a prior to be noninformative.</p><h3>How to Pick Non-informative Priors?</h3><p>Note: The presentation here of non-informative priors closely follows chapter 2 of Box &amp; Tiao’s book <a href="https://onlinelibrary.wiley.com/doi/book/10.1002/9781118033197">Bayesian Inference in Statistical Analysis</a>.</p><p>Suppose <strong>y</strong> is data generated from a normal distribution of mean 0 but unknown variance. Let σ denote the standard deviation and let ℓ denote the likelihood function</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/600/1*EAj3oTHDTC4jiLqfgLRy_Q@2x.png" /></figure><p>Suppose we impose a uniform prior over σ so that the posterior is</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/600/1*aC-T9TwnhoiHW18U-OyYOg@2x.png" /></figure><p>and the cumulative distribution function is</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/600/1*ay_BWDAndRtJEQUYcD6k9g@2x.png" /></figure><p>where N is some normalizing constant.</p><p>But now let’s suppose that instead of standard deviation we parameterize over variance. Making the substitution u=σ² into the cumulative distribution function gives us</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/600/1*mgZW1HC5kaDNJQWwzZz2UA@2x.png" /></figure><p>Thus, we see that choosing a uniform prior over σ is equivalent to choosing the improper prior</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/600/1*JpLixHuIIJZIXqhkk9xNng@2x.png" /></figure><p>over variance. In general, if u=φ(θ<strong>)</strong> is an alternative way of parameterizing the likelihood function where φ is some monotonic, one-to-one, onto function. Then a uniform prior over u is equivalent to choosing a prior</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/600/1*zVopJQbQnnJyc1vrudNnGg@2x.png" /></figure><p>over θ.</p><p>So, when does it make sense to use a uniform prior if the choice is sensitive to parameterization?</p><p>Let’s consider how the likelihood is affected by changes in the observed value <strong>y</strong>^⊤<strong>y</strong>.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/576/1*0LQzX6IhcBqLW_xUTCgASQ.png" /><figcaption>Likelihood function for the standard deviation of normally distributed data with zero mean, <br> n=10, and different values of <strong>y</strong>^⊤<strong>y</strong>.</figcaption></figure><p>As we can see from the figure, as <strong>y</strong>^⊤<strong>y</strong> is increased the shape of the likelihood function changes: it becomes less peaked and more spread out.</p><p>Observe that we can rewrite the likelihood function as</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/600/1*KWHIvZLfgeB3mMwUApzxzA@2x.png" /></figure><p>where</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/600/1*BbehrckkKk-qgGbUdEiS8Q@2x.png" /></figure><p>Thus, in the parameter log σ, likelihood has the form</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/600/1*2jHfORtUhd2ir3LcHWPeZg@2x.png" /></figure><p>where</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/600/1*ysAOjksaCW7uLEdLHfLLkA@2x.png" /></figure><p>And we say that the likelihood function is <em>data translated</em> in log σ because everything is known about the likelihood curve except the location and the value of the observation serves only to shift the curve.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/576/1*2vKl5BBFY_Jf5L2-9gWm7Q.png" /><figcaption>Likelihood function for the log standard deviation of normally distributed data with zero mean, <br> n=10, and different values of <strong>y</strong>^⊤<strong>y</strong>.</figcaption></figure><p>When the likelihood function is data translated in a parameter, then it makes sense to use a uniform function for a noninformative prior. Before observing data, we have no reason to prefer one range of parameters [a, b] over another range of the same width [t + a, t + b] because they will be equivalently placed relative to the likelihood curve if the observed data translates by t.</p><p>Now, let’s return to picking our priors.</p><h3>Picking Regularized Bayesian Linear Regression Priors</h3><p>For the parameter σ, we use the noninformative prior</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/600/1*2mbIBANh4PwSgRkpQngzmg@2x.png" /></figure><p>which is equivalent to using a uniform prior over the parameter log σ. For <strong>w</strong>, we want an informative prior that shrinks the weights, reflecting a prior belief that weights are non-predictive. Let η denote a hyperparameter that controls<br>the degree of shrinkage. Then we use the spherical normal distribution with covariance matrix (σ/λ_η)² <strong>I</strong></p><figure><img alt="" src="https://cdn-images-1.medium.com/max/600/1*25guNxURjVHRFLzVS19_9g@2x.png" /></figure><p>Note that we haven’t described yet how η parameterizes λ and we’ll also be integrating over η so we additionally have a prior for η (called a hyperprior) so that</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/600/1*t8hEWoxM8udPNTz6s69ISQ@2x.png" /></figure><p>Our goal is for the prior P(η) to be noninformative. So we want to know: In what parameterization, would P(<strong>y</strong>|η) be data translated?</p><h3>How to Parameterize the shrinkage prior?</h3><p>Before thinking about how to parameterize λ, let’s characterize the likelihood function for λ.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/600/1*Hng7CuOIMNGq6f4PCUllFg@2x.png" /></figure><p>Expanding the integrand, we have</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/600/1*JKkkc2wsmg6fo_vIJBk17w@2x.png" /></figure><p>where</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/600/1*M0CmOLauoQSzqYTg1ytPGg@2x.png" /></figure><p>and</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/600/1*OguPCRyzW-p6XlVN0s9jHg@2x.png" /></figure><p>Observe that #1 is independent of <strong>w</strong>, so the integral with respect to <strong>w</strong> is equivalent to integrating over an unnormalized Gaussian.</p><p>Note: Recall the formula for a normalized Gaussian integral</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/600/1*I68Lxpbtzr6iBb3YsWvlsQ@2x.png" /></figure><p>Thus,</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/600/1*TErm5T17gCZMI6iWBbgK1w@2x.png" /></figure><p>Next let’s consider the integral over σ.</p><p>Put</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/600/1*ZTaeGYsSeuesWhKpSBAaMQ@2x.png" /></figure><p>Then we can rewrite</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/600/1*817N2k5MyAMdRSDTwDv3Sw@2x.png" /></figure><p>After making a change of variables, we can express the integral with respect to σ as a form of the Gamma function.</p><p>Consider an integral</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/600/1*Pve3NBj0RvSeiqoTtHLjMg@2x.png" /></figure><p>Put</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/600/1*3lf8-jpXLVerHxYcIOpA8A@2x.png" /></figure><p>Then</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/600/1*9Xu6rJKFNDyN1kmIXZz8zQ@2x.png" /></figure><p>And</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/600/1*7kXGERcRPoAM0WQIUv3Nuw@2x.png" /></figure><p>where Γ denotes the Gamma function</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/600/1*b3EcpMAY9gFceJskk-_n5w@2x.png" /></figure><p>Thus, we can write</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/600/1*zmgvO_sqcjulEQxdj0-sdw@2x.png" /></figure><p>Let <strong>U</strong>, <strong>Σ</strong>, and <strong>V</strong> denote the singular value decomposition of <strong>X</strong><br>so that</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/600/1*o9WKwIFiBk1fpoVfDe1WVQ@2x.png" /></figure><p>Let ξ₁, ξ₂, … denote the non-zero diagonal entries of <strong>Σ</strong>. Put <strong>Λ</strong>=<strong>Σ</strong>^⊤<strong>Σ</strong>. Then <strong>Λ </strong>is a diagonal matrix with entries ξ₁², ξ₂², … and</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/600/1*8PC5Gv-F5WZge95DWsjnBQ@2x.png" /></figure><p>Note: To implement our algorithm, we will only need the matrix <strong>V</strong> and the nonzero diagonal entries of <strong>Σ</strong>, which can be efficiently computed with a partial singular value decomposition. See the LAPACK function <a href="http://www.netlib.org/lapack/explore-html/d1/d7e/group__double_g_esing_gad8e0f1c83a78d3d4858eaaa88a1c5ab1.html">gesdd</a>.</p><p>Put</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/600/1*M2xreODZTRCl-BZOAV-j3w@2x.png" /></figure><p>Then we can rewrite h and g as</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/600/1*RlTTv6Z67YhPrYBSDgIz1Q@2x.png" /></figure><p>And</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/600/1*M_V1PUpveNo5v_PwFFYcSg@2x.png" /></figure><p>Here we adopt the terminology</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/600/1*6tkzplN3d5e6guozrP1Uyg@2x.png" /></figure><p>Put</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/600/1*mhpJuYyVWgVb0OuR5rPsow@2x.png" /></figure><p>Then r is a monotonically increasing function that ranges from 0 to 1,<br>and we can think of r as the average shrinkage factor across the eigenvectors.</p><p>Now, let’s make an approximation by replacing individual eigenvector shrinkages with the average:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/600/1*PHssg06BEyfOr49eXkaHFg@2x.png" /></figure><p>Substituting the approximation into the likelihood then gives us</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/600/1*5zGmIu1WTZO4772DmoidPA@2x.png" /></figure><p>We see that the approximated likelihood can be expressed as a function of</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/600/1*vFctgOMCcdVqdJhOmNhgcQ@2x.png" /></figure><p>and it follows that the likelihood is approximately data translated in log r(λ).</p><p>Thus, we can achieve an approximately noninformative prior if we<br>let η represent the average shrinkage, put</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/600/1*o6XWF37ShYFKN58X2-xQKA@2x.png" /></figure><p>and use the hyperprior</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/600/1*bSWCjmtqXNb8cQYsPeovJQ@2x.png" /></figure><p>Note: To invert r, we can use a standard root-finding algorithm.</p><p>Differentiating r(λ) gives us</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/600/1*hXlYnpYl5otUpoNf2uETvA@2x.png" /></figure><p>Using the derivative and a variant of Newton’s algorithm, we can then quickly iterate to a solution of r⁻¹(η).</p><h3>Making Predictions</h3><p>The result of fitting a Bayesian model is the posterior distribution P(<strong>θ</strong>|<strong>y</strong>). Let’s consider how we can use the distribution to make predictions given a new data point <strong>x</strong>’.</p><p>We’ll start by computing the expected target value</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/600/1*TlEmkk3YZK6LihjWywoBAA@2x.png" /></figure><p>And</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/600/1*5_Pm13o6VRkAyIU7fgqhPw@2x.png" /></figure><p>Note: To compute expected values of expressions involving η, we need to integrate over the posterior distribution P(η|<strong>y</strong>). We won’t have an analytical form for the integrals, but we can efficiently and accurately integrate with an <a href="https://en.wikipedia.org/wiki/Adaptive_quadrature">adaptive Quadrature</a>.</p><p>To compute variance, we have</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/600/1*s0m8TGG3xm2QLVMQZmaOlw@2x.png" /></figure><p>and</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/600/1*7NiKAWkhwmF5tXCD3Sb-rQ@2x.png" /></figure><p>Starting with E[σ²|<strong>y</strong>], recall that</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/600/1*x8JODEtnAo8IcC44whC2oA@2x.png" /></figure><p>And</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/600/1*vCbSXfAFJnkyVTePlhlUSA@2x.png" /></figure><p>Thus,</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/600/1*KvnfRIBNAyfnuIKU6byxIg@2x.png" /></figure><p>Note: The Gamma function has the property</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/600/1*B_q3-1by7MA9rvPBBhHstw@2x.png" /></figure><p>So,</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/600/1*yhLLEca48slD1LQcCuIeUg@2x.png" /></figure><p>And</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/600/1*rAm647OvbxYAp98Or-yygg@2x.png" /></figure><p>It follows that</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/600/1*-Yq3arlaMQ3BwbR4wTF5ww@2x.png" /></figure><p>For E[<strong>w</strong> <strong>w</strong>^T|<strong>y</strong>], we have</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/600/1*TR9vL-QKB6p6ggDq2ydUBA@2x.png" /></figure><h3>Outline of Algorithm</h3><p>We’ve seen that computing statistics about predictions involve integrating over the posterior distribution P(η|<strong>y</strong>). We’ll briefly sketch out an algorithm for computing such integrals. We describe it only for computing the expected value of <strong>w</strong>. Other expected values can be computed with straightforward modifications.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/600/1*G1Y4nx92kYdehWWDtN8_jA@2x.png" /></figure><p>The proceedure SHRINK-INVERSE applies Newton’s algorithm for root-finding with r and r’ to compute r⁻¹.</p><p>To compute the hyperparameter posterior integration, we make use of an adaptive quadrature algorithm. Quadratures approximate an integral by a weighted sum at different points of evaluation.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/600/1*ouuKRns4DobEb3eGj-s2Mg@2x.png" /></figure><p>In general, the more points of evaluation used, the more accurate the approximation will be. Adaptive quadrature algorithms compare the integral approximation at different levels of refinement to approximate the error and increase the number of points of evaluation until a desired tolerance is reached.</p><p>Note: We omit the details of the quadrature algorithm used and describe only at a high level. For more information on specific quadrature algorithms refer to <a href="https://en.wikipedia.org/wiki/Gauss%E2%80%93Hermite_quadrature">Gauss-Hermite Quadratures</a> and <a href="https://en.wikipedia.org/wiki/Tanh-sinh_quadrature">Tanh-sinh Quadratures</a>.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/600/1*uPbfMtO6fVBwTl5uKMW8GQ@2x.png" /></figure><h3>Experimental Results</h3><p>To better understand how the algorithm works in practice, we’ll set up a small<br>experimental data set, and we’ll compare a model fit with the Bayesian algorithm to Ordinary Least Squares (OLS), and a ridge regression model fit so as to minimize error on a Leave-one-out Cross-validation (LOOCV) of the data set.</p><p>Full source code for the experiment is available at <a href="https://github.com/rnburn/bbai/tree/master/example/03-bayesian.py">github.com/rnburn/bbai/example/03-bayesian.py</a>.</p><h4>Generating the Data Set</h4><p>We’ll start by setting up the data set. For the design matrix, we’ll randomly generate a 20-by-10 matrix <strong>X</strong> using a Gaussian with zero mean and covariance matrix <strong>K</strong> where</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/600/1*HxNHvGQYnK2KeCqzhjMUdg@2x.png" /></figure><p>We’ll generate a weight vector with 10 elements from a spherical Gaussian with unit variance, and we’ll rescale the weights so that the signal variance is equal to 1.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/600/1*3k6KTPLaHN83u3x3ox1LcQ@2x.png" /></figure><p>Then we’ll set</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/600/1*nmezUB4sF8rtn2iyAJJzIQ@2x.png" /></figure><p>where <strong>ε</strong> is a vector of 20 elements taken from a Gaussian with unit variance.</p><p>Here’s the Python code that sets up the data set.</p><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/f6b0e90a68aef8629029df4e3c5f844e/href">https://medium.com/media/f6b0e90a68aef8629029df4e3c5f844e/href</a></iframe><h4>Fitting Models</h4><p>Now, we’ll fit a Bayesian ridge regression model, an OLS model, and a ridge regression model with the regularization strength set so that mean squared error is minimized on a LOOCV.</p><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/c0cf705c061dcd57fe537e9af5a0fdcd/href">https://medium.com/media/c0cf705c061dcd57fe537e9af5a0fdcd/href</a></iframe><p>We can measure the out-of-sample error variance for each model using the equation</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/600/1*xq2wxGURUVuKt4agW8IZrw@2x.png" /></figure><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/3e5f5930b4256374e6bc6c795e5498a9/href">https://medium.com/media/3e5f5930b4256374e6bc6c795e5498a9/href</a></iframe><p>Doing this gives us</p><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/3129c728a67f00f0ae955eb3f6ef96c0/href">https://medium.com/media/3129c728a67f00f0ae955eb3f6ef96c0/href</a></iframe><p>We see that both the Bayesian and ridge regression models are able to prevent overfitting and achieve better out-of-sample results.</p><p>Finally, we’ll compare the estimated noise variance from the Bayesian model to that from the OLS model.</p><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/22191e4c3854ec41d4542f90fb4785a1/href">https://medium.com/media/22191e4c3854ec41d4542f90fb4785a1/href</a></iframe><p>This gives us</p><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/5062fcb761254ffad917d749827a010a/href">https://medium.com/media/5062fcb761254ffad917d749827a010a/href</a></iframe><h3>Conclusion</h3><p>When applying Bayesian methods to ridge regression, we need to address: how do we handle the hyperparameter that controls regularization strength?</p><p>One option is to use a point estimate, where a value of the hyperparameter is chosen to optimize some metric (e.g. likelihood or a cross-validation). But such an approach goes against the Bayesian methodology of using probability distributions expressing belief for parameters, particularly if the likelihood is not strongly peaked about a particular value of the hyperparameter.</p><p>Another option commonly used is to apply a known distribution for the hyperparameter prior (for example a half-Cauchy distribution). Such an approach gives us a posterior distribution for the hyperparameter and can work fairly well in practice, but the choice of prior distribution is somewhat arbitrary.</p><p>In this blog post, we showed a different approach where we selected a hyperparameter prior distribution so as to be approximately noninformative. We developed algorithms to compute standard prediction statistics given the noninformative prior, and we demonstrated on a small example that it compared favorably to using a point estimate of the hyperparameter selected to optimize a leave-one-out cross-validation.</p><h3>Appendix A: Comparison with Other Bayesian Algorithms</h3><p>Here, we’ll give a brief comparison of the algorithm presented to scikit-learn’s algorithm for <a href="https://scikit-learn.org/stable/modules/linear_model.html#bayesian-ridge-regression">Bayesian ridge regression</a>.</p><p>Scikit-learn’s algorithm makes use of conjugate priors and because of that is restricted to use the Gamma prior which requires four hyperparameters chosen arbitrarily to be small values. Additionally, it requires initial values for parameters α and λ that are then updated from the data.</p><p>The algorithm’s performance can be sensitive to the choice of values for these parameters, and scikit-learn’s documentation provides a curve fitting <a href="https://scikit-learn.org/stable/auto_examples/linear_model/plot_bayesian_ridge_curvefit.html">example</a> where the defaults perform poorly.</p><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/d78d5193b1d2ce83bc16c922fd3c44e6/href">https://medium.com/media/d78d5193b1d2ce83bc16c922fd3c44e6/href</a></iframe><figure><img alt="" src="https://cdn-images-1.medium.com/max/720/1*ApxrpYULBMkGW9Sbd-e9gg.png" /></figure><p>In comparison, the algorithm we presented requires no initial parameters; and because the hyperparameter is integrated over, poor performing values contribute little to the posterior probability mass.</p><p>We can see that our approach handles the curve-fitting example without requiring any tweaking.</p><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/d086c11e3eb27ddc16d4b8b63e1c6559/href">https://medium.com/media/d086c11e3eb27ddc16d4b8b63e1c6559/href</a></iframe><figure><img alt="" src="https://cdn-images-1.medium.com/max/360/1*hh7FMoSy5wrLnpfcW3da9g.png" /></figure><p>The full curve-fitting comparison example is available at <a href="https://github.com/rnburn/bbai/blob/master/example/04-curve-fitting.ipynb">github.com/rnburn/bbai/blob/master/example/04-curve-fitting</a>.</p><p><em>This blog post was originally published at </em><a href="https://buildingblock.ai/bayesian-ridge-regression"><em>buildingblock.ai/bayesian-ridge-regression</em></a><em>.</em></p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=f4ac2bdaf329" width="1" height="1" alt=""><hr><p><a href="https://medium.com/data-science/how-to-build-a-bayesian-ridge-regression-model-with-full-hyperparameter-integration-f4ac2bdaf329">How to Build a Bayesian Ridge Regression Model with Full Hyperparameter Integration</a> was originally published in <a href="https://medium.com/data-science">TDS Archive</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Why Standard C++ Math Functions Are Slow]]></title>
            <link>https://itnext.io/why-standard-c-math-functions-are-slow-d10d02554e33?source=rss-f55ad0a8217------2</link>
            <guid isPermaLink="false">https://medium.com/p/d10d02554e33</guid>
            <category><![CDATA[cplusplus]]></category>
            <category><![CDATA[performance]]></category>
            <category><![CDATA[programming]]></category>
            <dc:creator><![CDATA[Ryan Burn]]></dc:creator>
            <pubDate>Sun, 27 Dec 2020 03:37:05 GMT</pubDate>
            <atom:updated>2021-12-09T17:59:23.028Z</atom:updated>
            <content:encoded><![CDATA[<figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*9URxUNoctIyySKJV6dt4CA.png" /><figcaption>Assembly for a C++ function computing square roots</figcaption></figure><p>Performance has always been a high priority for C++, yet there are many examples both in the language and the standard library where compilers produce code that is significantly slower than what a machine is capable of. In this blog post, I’m going to explore one such example from the standard math library.</p><p>Suppose we’re tasked with computing the square roots of an array of floating point numbers. We might write a function like this to perform the operation:</p><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/c46f885539cec764826566e2ef42edf5/href">https://medium.com/media/c46f885539cec764826566e2ef42edf5/href</a></iframe><p>If we’re using gcc, we can compile the code with</p><pre>g++ -c -O3 -march=native sqrt1.cpp</pre><p>With -O3, gcc will optimize the code heavily but will still produce code that is standard compliant. The -march=native option tells gcc to produce code targeting the native architecture’s instruction set. The resulting binaries may not be portable even between different x86-64 CPUs.</p><p>Now, let’s benchmark the function. We’ll use <a href="https://github.com/google/benchmark">google benchmark</a> to measure how long it takes to compute the square roots of 1,000,000 numbers:</p><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/477ef4b70f73d2bd4773371e3bc2c734/href">https://medium.com/media/477ef4b70f73d2bd4773371e3bc2c734/href</a></iframe><p>Compiling our benchmark and running we get</p><pre>g++ -O3 -march=native -o benchmark benchmark.cpp sqrt1.o<br>./benchmark<br>Running ./benchmark<br>Run on (6 X 2600 MHz CPU s)<br>CPU Caches:<br> L1 Data 32 KiB (x6)<br> L1 Instruction 32 KiB (x6)<br> L2 Unified 256 KiB (x6)<br> L3 Unified 9216 KiB (x6)<br>Load Average: 0.17, 0.07, 0.05<br> — — — — — — — — — — — — — — — — — — — — — — — — — — — — — -<br>Benchmark Time CPU Iterations<br> — — — — — — — — — — — — — — — — — — — — — — — — — — — — — -<br>BM_Sqrt1/1000000 4984457 ns 4946631 ns 115</pre><p>Can we do better? Let try this version:</p><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/ee9de43635c977c3177bfa21b1d200cd/href">https://medium.com/media/ee9de43635c977c3177bfa21b1d200cd/href</a></iframe><p>and compile with</p><pre>g++ -c -O3 -march=native -fno-math-errno sqrt2.cpp</pre><p>The only difference between compute_sqrt1 and compute_sqrt2 is that we added the extra option -fno-math-errno when compiling. I’ll explain later what -fno-math-errno does; but for now, I’ll only point out that the produced code is no longer standard compliant.</p><p>Let’s benchmark compute_sqrt2.</p><iframe src="" width="0" height="0" frameborder="0" scrolling="no"><a href="https://medium.com/media/f74ac7944b974f1a8e1da8a72526298b/href">https://medium.com/media/f74ac7944b974f1a8e1da8a72526298b/href</a></iframe><p>Running</p><pre>g++ -O3 -march=native -o benchmark benchmark.cpp sqrt2.o<br>./benchmark</pre><p>we get</p><pre>Running ./benchmark<br>Run on (6 X 2600 MHz CPU s)<br>CPU Caches:<br> L1 Data 32 KiB (x6)<br> L1 Instruction 32 KiB (x6)<br> L2 Unified 256 KiB (x6)<br> L3 Unified 9216 KiB (x6)<br>Load Average: 0.17, 0.07, 0.05<br> — — — — — — — — — — — — — — — — — — — — — — — — — — — — — -<br>Benchmark Time CPU Iterations<br> — — — — — — — — — — — — — — — — — — — — — — — — — — — — — -<br>BM_Sqrt2/1000000 1195070 ns 1192078 ns 553</pre><p>Yikes! compute_sqrt2 is <em>more than 4 times faster</em> than compute_sqrt1.</p><p>What’s different? Let’s drill down into the assembly to find out. We can produce the assembly for the code by running</p><pre>g++ -S -c -O3 -march=native sqrt1.cpp<br>g++ -S -c -O3 -march=native -fno-math-errno sqrt2.cpp</pre><p>The result will depend on what architecture you’re using, but looking at <a href="https://github.com/rnburn/cmath-benchmark/blob/main/asm/sqrt1.s">sqrt1.s</a> on my architecture, we see this section</p><pre>.L3:<br> vmovsd (%rdi), %xmm0<br> vucomisd %xmm0, %xmm2<br> vsqrtsd %xmm0, %xmm1, %xmm1<br> ja .L12<br> addq $8, %rdi<br> vmovsd %xmm1, (%rdx)<br> addq $8, %rdx<br> cmpq %r12, %rdi<br> jne .L3</pre><p>Let’s break down the first few instructions:</p><pre>1: vmovsd (%rdi), %xmm0 <br> # Load a value from memory into the register %xmm0<br>2: vucomisd %xmm0, %xmm2<br> # Compare the value of %xmm0 with %xmm2 and set the register<br> # EFLAGS with the result<br>3: vsqrtsd %xmm0, %xmm1, %xmm1 <br> # Compute the square root of %xmm0 and store in %xmm1<br>4: ja .L12 <br> # Inspects EFLAGS and jumps if %xmm2 is above %xmm0</pre><p>What are instructions 3 and 4 for? Recall that for real numbers, sqrt is undefined on negative values. When std::sqrt is passed a negative number, the C++ standard requires that it return the special floating point value NaN and that it set the global variable errno to EDOM. But that error handling ends up being really expensive.</p><p>If we look at <a href="https://github.com/rnburn/cmath-benchmark/blob/main/asm/sqrt2.s">sqrt2.s</a>, we see these instructions for the main loop:</p><pre>.L6:<br> addl $1, %r8d<br> vsqrtpd (%r10,%rax), %ymm0<br> vextractf128 $0x1, %ymm0, 16(%rcx,%rax)<br> vmovups %xmm0, (%rcx,%rax)<br> addq $32, %rax<br> cmpl %r8d, %r11d<br> ja .L6</pre><p>Without the burden of having to do error handling, gcc can produce much faster code. vsqrtpd is what’s known as a Single Instruction Multiple Data (SIMD) instruction. It computes the the square root of four double precision floating point numbers at a time. For computationally expensive functions like sqrt, vectorization helps a lot.</p><p>It’s unfortunate that the standard requires such error handling. It’s so much slower to do the error checking that many compilers like Intel’s icc and Apple’s default clang-based compiler opt out of the error handling by default. Even if we want std::sqrt do error handling, we can’t portably rely on major compilers to do so.</p><p><em>The complete benchmark can be found at </em><a href="https://github.com/rnburn/cmath-benchmark"><em>rnburn/cmath-bechmark</em></a><em>. This story wasa originally published at </em><a href="https://ryanburn.com/2020/12/26/why-c-standard-math-functions-are-slow/">https://ryanburn.com/2020/12/26/why-c-standard-math-functions-are-slow/</a></p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=d10d02554e33" width="1" height="1" alt=""><hr><p><a href="https://itnext.io/why-standard-c-math-functions-are-slow-d10d02554e33">Why Standard C++ Math Functions Are Slow</a> was originally published in <a href="https://itnext.io">ITNEXT</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
    </channel>
</rss>