Stories by Ryan Burn on Medium

Constructing Default Priors for Bayesian Hypothesis Testing: A Reference Analysis Approach

Ryan Burn — Mon, 31 Mar 2025 02:25:41 GMT

Consider these birth records from Paris collected between 1745 and 1770:

Continue reading on Data Science Collective »

How to Apply the Central Limit Theorem to Constrained Data

Ryan Burn — Tue, 10 Dec 2024 19:58:16 GMT

What can we say about the mean of data distributed in an interval [a, b]?

Continue reading on TDS Archive »

Using Objective Bayesian Inference to Interpret Election Polls

Ryan Burn — Wed, 30 Oct 2024 00:16:17 GMT

How to build a polls-only objective Bayesian model that goes from a state polling lead to probability of winning the state

Continue reading on TDS Archive »

Comparing Sex Ratios: Revisiting a Famous Statistical Problem from the 1700s

Ryan Burn — Fri, 09 Aug 2024 20:04:03 GMT

What can we say about the difference of two binomial distribution probabilities

Continue reading on TDS Archive »

How to Efficiently Approximate a Function of One or More Variables

Ryan Burn — Fri, 28 Jun 2024 23:05:11 GMT

Use sparse grids and Chebyshev interpolants to build accurate approximations to multivariable functions.

Continue reading on TDS Archive »

Introduction to Objective Bayesian Hypothesis Testing

Ryan Burn — Tue, 11 Jun 2024 23:50:13 GMT

How to Derive Posterior Probabilities for Hypotheses using Default Bayes Factors

Continue reading on TDS Archive »

An Introduction to Objective Bayesian Inference

Ryan Burn — Tue, 23 Apr 2024 04:30:06 GMT

How to calculate probability when “we absolutely know nothing antecedently to any trials made” (Bayes, 1763)

Continue reading on TDS Archive »

Logistic Regression and the Missing Prior

Ryan Burn — Thu, 10 Mar 2022 14:22:57 GMT

How to reduce the bias of logistic regression using Jeffreys Prior

Suppose X denotes a matrix of regressors and y,

denotes a vector of target values.

If we hypothesize that the data is generated from a logistic regression model, then our belief in weights w after seeing the data is given by

Here, P(y|X, w) is called the likelihood and is given by

And P(w) is called the prior and quantifies our belief in w before seeing data.

Frequently, logistic regression is fit to maximize P(y|X, w) without regard to the prior

and w_ML is what we call a maximum likelihood estimate for the weights.

But disregarding the prior is generally a bad idea. Let’s see why.

What’s Wrong with Maximum Likelihood Models

Consider a concrete example.

Suppose you’re a biologist researching wolves. As part of your work, you’ve been tracking wolf packs and observing their hunting behavior.

Photo by Thomas Bonometti on Unsplash

You’d like to understand what makes a hunt successful. To that end, you fit a logistic regression model to predict whether a hunt will succeed or fail. It would be interesting to consider factors such as pack size and prey type, but to keep it simple you start with only a bias term.

Let n denote the total number of hunts. If b denotes the value of our
bias term, then the likelihood of observing k successful hunts is given by

Now, let’s suppose that we observe 8 hunts and record no successful hunts. (The hunter success rate for wolves is actually quite low, so such a result would not be surprising). What would be a good working estimate for b? The maximum likelihood estimate would be -∞, yet we know that wouldn’t make sense… why bother hunting at all if there’s no chance of success.

We can obtain a better working estimate if instead of selecting b to maximize likelihood, we select b to maximize the probability of the posterior distribution

with a suitably chosen prior P(b). We’ll discuss later the process for selecting priors but for now consider the prior

The posterior for this prior is

Put p = (1 + exp(-b))⁻¹. We can find the MAP estimate by taking the logarithm, differentiating, and solving for 0

Solving for 0, we have

In our example, this then gives us the much more reasonable estimate

The prior we used is an example of a noninformative prior.

What Are Noninformative Priors and How Do We Find Them

Note: The presentation here of noninformative priors closely follows chapter 2 of Box & Tiao’s book Bayesian Inference in Statistical Analysis.

Suppose we’re given a likelihood function L(θ|y) for a model with parameters θ and data y. What’s a good prior to use when we have no particular knowledge about θ?

Let’s look at a concrete example. Suppose data is generated from a normal distribution with known variance σ² but unknown mean θ. The likelihood function for this model is

And here’s a plot of the likelihood for n=5 and different values of y̅:

Likelihood function for the mean of normally distributed data with variance one,
n=5, and different values of y̅. Image by author. Image by author.

Note how the likelihood curves have the same shape for all values of y̅ and vary only by translation. When the likelihood curve has this property we say that it is data translated.

Now consider two ranges of the same size [a, b] and [a + t, b + t]. Both would be equivalently placed relative to the likelihood curve if the observed data translates by y̅+t instead of y̅. Thus, we should have

or the uniform prior

Suppose we use an alternative parameterization for a likelihood function L(θ|y): We parameterize by u = ϕ(θ) and apply the uniform prior to u where ϕ is some monotonic injective transformation. The cumulative distribution function is then given by

where Z is some normalizing constant. Making the substitution θ = ϕ⁻¹(u) to go back to the original parameterization then gives us

So we see that a uniform prior over u is equivalent to a prior P(θ) ∝ ϕ’(θ) over θ. And finding a noninformative prior for θ is equivalent to finding a transformation ϕ that makes the likelihood function data translated.

Let’s now return to the likelihood function for a logistic regression model with only a bias term

Plotting the likelihood for n=8 and different values of k gives us

Likelihood function for a logistic regression model with only a bias term,
n=8, and different values of k. Image by author.

It’s pretty clear from looking at the graph that the likelihood function is not data translated. The curve for k=4 is more peaked than the others, and the curves for k=1 and k=6 are skewed.

Now, for suitably behaved likelihood functions, we can approximate L(θ|y) by
a Gaussian using a second-order Taylor series expansion of log L(θ|y) about
θ_ML

where

As n →∞, the approximation becomes more and more accurate.

Let’s see how this works for our simplified logistic regression model.

Differentiating log L(b|k) gives us

Differentiating again gives us

And here’s what the Gaussian approximation looks like

Likelihood function for a logistic regression model with only a bias term together
with its Gaussian approximation for n=8 and k=1. Image by author.

Now, consider what happens to our Gaussian approximation of L(θ,y) when we reparameterize with θ = ϕ⁻¹(u).

Differentiating gives us

Differentiating again gives us

Evaluating at a maximum u_ML= ϕ(θ_ML), we have

where we’ve applied

Returning to logistic regression, put

Then we have

Thus, the Gaussian approximation about u_ML becomes

so that the likelihood function is now approximately data translated.

Reparameterized likelihood function for a logistic regression model with only a bias term,
n=8, and different values of k. Image by author.

Jeffreys Prior

Let’s now give a brief sketch of Jeffreys prior for the multiparameter case.
Let θ = (θ₁, …, θ_k)^T denote the vector of parameters.

Similar to the single parameter case, the likelihood function function can be approximated by a Gaussian about the optimum

where

Now, suppose we reparameterize with u= ϕ(θ). Then the updated hessian for the Gaussian approximation becomes

where

In the general multivariable case, we may not be able to reparameterize in such a way the Gaussian approximation becomes data translated, but suppose we select ϕ so that

Then

Having a constant determinant, means that the volume of regions will be preserved.

Let S ∈ ℝᵏ be a region about the origin and K be the covariance matrix
of the reparameterized Gaussian approximation about θ_ML. If L is the Cholesky factorization of K

then S corresponds to the region L⁻¹S+θ_ML about θ_ML that has fixed probability mass and volume

The corresponding prior for θ of a uniform prior on u is

And this is called Jeffreys prior.

Let’s apply this to the full binomial logistic regression model with weights. The negative log likelihood function is given by

where

Differentiating gives us

where

Differentiating again gives us

where

We can write the hessian as

where A is the diagonal matrix with

The Jeffreys prior is then

Fitting Logistic Regression with Jeffreys Prior

Finding the MAP model for logistic regression with the Jeffreys prior amounts to solving the optimization problem

Let π denote the log of Jeffreys prior

If we can find equations for the gradient and hessian of π, then we can apply standard second-order optimization algorithms to find w_MAP.

Starting with the gradient, we can apply Jacobi’s Formula for differentiating a determinant

to get

where the derivative of A_w is the diagonal matrix

For the hessian, we apply the formula for differentiating an inverse matrix

to get

where

By evaluation the gradient and hessian computations in an efficient manner and applying a second-order trust region optimizer, we can then quickly iterate to find w_MAP. A Python package using this approach is available at https://github.com/rnburn/bbai.

Let’s look at how this works on some examples.

Example 1: Simulated Data with Single Variable

We’ll start by looking at the prior and posterior for a simulation dataset with
a single regressor and no bias.

We begin by generating the dataset.

https://medium.com/media/ab41baf98cdbb6e1e8fb9cdb42977970/href

We next add a function to compute the prior

https://medium.com/media/c312db6a2e9fdd693362ef93fe87d2e8/href

And plot it out across a range of values

https://medium.com/media/076336cc5d41b3353a2a0c22e4aecf2b/href

Jeffreys prior for logistic regression on a simulation dataset with a single regressor. Image by author.

Next, we fit a logistic regression MAP model and compare w_MAP to
w_true.

https://medium.com/media/0d11a1b2f761f2497d893a8b8c3bb849/href

Prints

And finally, we plot out the posterior distribution.

https://medium.com/media/b4731ea0202878d473da9cef413a2401/href

Posterior for logistic regression on a simulation dataset with a single regressor together with w_MAP. Image by author.

The full example is available at github.com/rnburn/bbai/example/05-jeffreys1.ipynb.

Example 2: Simulated Data with Two Variables

Let’s look next at an example with two variables.

We generate a data set with two variables

https://medium.com/media/50b2db79d99e13fe0fd58eb9251af99f/href

We compute the prior and plot it across a range of values

https://medium.com/media/23faf03e404949cb3bbecb8bdd07b3be/href

Jeffreys prior for logistic regression on a simulation dataset with two regressors. Image by author.

Now, we’ll fit a model to find w_MAP.

https://medium.com/media/d57cd25440d990a20ea11af0707d73ff/href

Prints

And finally, we plot out the posterior distribution centered at w_MAP.

https://medium.com/media/deb66ddc35bf085339667ed8df7b8ff9/href

Posterior for logistic regression with Jeffreys prior on a simulation dataset with two regressors. Image by author.

The full example is available at github.com/rnburn/bbai/example/06-jeffreys2.ipynb.

Example 3: Breast Cancer Data Set

Now we’ll fit a model to the real-world breast cancer data set.

We begin by loading and preprocessing the data set.

https://medium.com/media/67cadff5492ca9b11ef9fe303cde3105/href

We then fit a logistic regression MAP model and use the Fisher information matrix to estimate the standard error.

https://medium.com/media/be593840c91be31e653b112d3c68884e/href

And we print out the weights along with their standard errors.

https://medium.com/media/4b60f82c069f46053265658866772a82/href https://medium.com/media/4a87420ef5f85011a29736712727674d/href

The full example is available at github.com/rnburn/bbai/example/07-jeffreys-breast-cancer.py.

Conclusion and Further Reading

Unlike Ordinary Least Squares, the likelihood function for logistic regression is not data translated. Searching for a data translated transformation leads us to Jeffreys prior, a natural shrinkage prior.

Firth 1993 [1] and Kosmidis & Firth 2021 [2] analyzed the statistical properties of the estimator that maximizes likelihood with the Jeffreys prior penalization and found it to have smaller asymptotic bias order than the standard maximum likelihood estimator. The reduced bias property of Jeffreys prior combined with its ability to handle the problem of separation [3] can make it an excellent drop in replacement to the standard approach for fitting logistic regression models.

Blog post originally published at https://buildingblock.ai/logistic-regression-jeffreys.

References

[1] Firth, D. (1993). Bias reduction of maximum likelihood estimates. Biometrika 80, 27–38.

[2] Kosmidis, I. & Firth, D. (2021). Jeffreys-prior penalty, finiteness and shrinkage in binomial-response generalized linear models. Biometrika, Volume 108, Issue 1, March 2021, Pages 71–82,

[3]: Heinze, G. & Schemper, M. (2002). A solution to the problem of separation in logistic regression. Statist. Med. 21, 2409–19.

Logistic Regression and the Missing Prior was originally published in TDS Archive on Medium, where people are continuing the conversation by highlighting and responding to this story.

How to Build a Bayesian Ridge Regression Model with Full Hyperparameter Integration

Ryan Burn — Wed, 23 Feb 2022 04:53:23 GMT

How do we handle the hyperparameter that controls regularization strength?

In this blog post, we’ll describe an algorithm for Bayesian ridge regression where the hyperparameter representing regularization strength is fully integrated over. An implementation is available at github.com/rnburn/bbai.

Let θ = (σ², w) denote the parameters for a linear regression model with weights w and normally distributed errors of variance σ².

If X represents an n×p matrix of full rank with p regressors
and n rows, then θ specifies a probability distribution over possible
target values y

Suppose we observe y and assume y is generated from a linear model of unknown parameters. A Bayesian approach to inference seeks to quantify our belief in the unknown parameters θ given the observation.

Applying Bayes’ theorem, we can rewrite the probability as

where we refer to

P(θ|y) as the posterior distribution
P(y|θ) as the likelihood function
P(θ) the prior distribution

The prior distribution describes our belief of θ before observing y and the posterior distribution describes our updated belief after observing y.

Suppose the prior distribution can be expressed as

where h(⋅, η) denotes a family of probability distributions parameterized
by what we call a hyperparameter η.

Traditional approaches to Bayesian linear regression have used what are called conjugate priors. A family of priors h(⋅, η) is conjugate if the posterior also belongs to the family

Conjugate priors are mathematically convenient as successive observations can be viewed as making successive updates to the parameters of a family of distributions, but requiring h(⋅, η) to be conjugate is a strong assumption.

Note: See Appendix A for a more detail comparison to other Bayesian algorithms.

We’ll instead describe an algorithm where

Priors are selected to shrink w, reflecting the prior hypothesis that w is not
predictive, and be approximately noninformative for other parameters.
We fully integrate over hyperparameters so that no arbitrary choice of η is required.

Let’s first consider what it means for a prior to be noninformative.

How to Pick Non-informative Priors?

Note: The presentation here of non-informative priors closely follows chapter 2 of Box & Tiao’s book Bayesian Inference in Statistical Analysis.

Suppose y is data generated from a normal distribution of mean 0 but unknown variance. Let σ denote the standard deviation and let ℓ denote the likelihood function

Suppose we impose a uniform prior over σ so that the posterior is

and the cumulative distribution function is

where N is some normalizing constant.

But now let’s suppose that instead of standard deviation we parameterize over variance. Making the substitution u=σ² into the cumulative distribution function gives us

Thus, we see that choosing a uniform prior over σ is equivalent to choosing the improper prior

over variance. In general, if u=φ(θ) is an alternative way of parameterizing the likelihood function where φ is some monotonic, one-to-one, onto function. Then a uniform prior over u is equivalent to choosing a prior

over θ.

So, when does it make sense to use a uniform prior if the choice is sensitive to parameterization?

Let’s consider how the likelihood is affected by changes in the observed value y^⊤y.

Likelihood function for the standard deviation of normally distributed data with zero mean,
n=10, and different values of y^⊤y.

As we can see from the figure, as y^⊤y is increased the shape of the likelihood function changes: it becomes less peaked and more spread out.

Observe that we can rewrite the likelihood function as

where

Thus, in the parameter log σ, likelihood has the form

where

And we say that the likelihood function is data translated in log σ because everything is known about the likelihood curve except the location and the value of the observation serves only to shift the curve.

Likelihood function for the log standard deviation of normally distributed data with zero mean,
n=10, and different values of y^⊤y.

When the likelihood function is data translated in a parameter, then it makes sense to use a uniform function for a noninformative prior. Before observing data, we have no reason to prefer one range of parameters [a, b] over another range of the same width [t + a, t + b] because they will be equivalently placed relative to the likelihood curve if the observed data translates by t.

Now, let’s return to picking our priors.

Picking Regularized Bayesian Linear Regression Priors

For the parameter σ, we use the noninformative prior

which is equivalent to using a uniform prior over the parameter log σ. For w, we want an informative prior that shrinks the weights, reflecting a prior belief that weights are non-predictive. Let η denote a hyperparameter that controls
the degree of shrinkage. Then we use the spherical normal distribution with covariance matrix (σ/λ_η)² I

Note that we haven’t described yet how η parameterizes λ and we’ll also be integrating over η so we additionally have a prior for η (called a hyperprior) so that

Our goal is for the prior P(η) to be noninformative. So we want to know: In what parameterization, would P(y|η) be data translated?

How to Parameterize the shrinkage prior?

Before thinking about how to parameterize λ, let’s characterize the likelihood function for λ.

Expanding the integrand, we have

where

and

Observe that #1 is independent of w, so the integral with respect to w is equivalent to integrating over an unnormalized Gaussian.

Note: Recall the formula for a normalized Gaussian integral

Thus,

Next let’s consider the integral over σ.

Put

Then we can rewrite

After making a change of variables, we can express the integral with respect to σ as a form of the Gamma function.

Consider an integral

Put

Then

And

where Γ denotes the Gamma function

Thus, we can write

Let U, Σ, and V denote the singular value decomposition of X
so that

Let ξ₁, ξ₂, … denote the non-zero diagonal entries of Σ. Put Λ=Σ^⊤Σ. Then Λ is a diagonal matrix with entries ξ₁², ξ₂², … and

Note: To implement our algorithm, we will only need the matrix V and the nonzero diagonal entries of Σ, which can be efficiently computed with a partial singular value decomposition. See the LAPACK function gesdd.

Put

Then we can rewrite h and g as

And

Here we adopt the terminology

Put

Then r is a monotonically increasing function that ranges from 0 to 1,
and we can think of r as the average shrinkage factor across the eigenvectors.

Now, let’s make an approximation by replacing individual eigenvector shrinkages with the average:

Substituting the approximation into the likelihood then gives us

We see that the approximated likelihood can be expressed as a function of

and it follows that the likelihood is approximately data translated in log r(λ).

Thus, we can achieve an approximately noninformative prior if we
let η represent the average shrinkage, put

and use the hyperprior

Note: To invert r, we can use a standard root-finding algorithm.

Differentiating r(λ) gives us

Using the derivative and a variant of Newton’s algorithm, we can then quickly iterate to a solution of r⁻¹(η).

Making Predictions

The result of fitting a Bayesian model is the posterior distribution P(θ|y). Let’s consider how we can use the distribution to make predictions given a new data point x’.

We’ll start by computing the expected target value

And

Note: To compute expected values of expressions involving η, we need to integrate over the posterior distribution P(η|y). We won’t have an analytical form for the integrals, but we can efficiently and accurately integrate with an adaptive Quadrature.

To compute variance, we have

and

Starting with E[σ²|y], recall that

And

Thus,

Note: The Gamma function has the property

So,

And

It follows that

For E[w w^T|y], we have

Outline of Algorithm

We’ve seen that computing statistics about predictions involve integrating over the posterior distribution P(η|y). We’ll briefly sketch out an algorithm for computing such integrals. We describe it only for computing the expected value of w. Other expected values can be computed with straightforward modifications.

The proceedure SHRINK-INVERSE applies Newton’s algorithm for root-finding with r and r’ to compute r⁻¹.

To compute the hyperparameter posterior integration, we make use of an adaptive quadrature algorithm. Quadratures approximate an integral by a weighted sum at different points of evaluation.

In general, the more points of evaluation used, the more accurate the approximation will be. Adaptive quadrature algorithms compare the integral approximation at different levels of refinement to approximate the error and increase the number of points of evaluation until a desired tolerance is reached.

Note: We omit the details of the quadrature algorithm used and describe only at a high level. For more information on specific quadrature algorithms refer to Gauss-Hermite Quadratures and Tanh-sinh Quadratures.

Experimental Results

To better understand how the algorithm works in practice, we’ll set up a small
experimental data set, and we’ll compare a model fit with the Bayesian algorithm to Ordinary Least Squares (OLS), and a ridge regression model fit so as to minimize error on a Leave-one-out Cross-validation (LOOCV) of the data set.

Full source code for the experiment is available at github.com/rnburn/bbai/example/03-bayesian.py.

Generating the Data Set

We’ll start by setting up the data set. For the design matrix, we’ll randomly generate a 20-by-10 matrix X using a Gaussian with zero mean and covariance matrix K where

We’ll generate a weight vector with 10 elements from a spherical Gaussian with unit variance, and we’ll rescale the weights so that the signal variance is equal to 1.

Then we’ll set

where ε is a vector of 20 elements taken from a Gaussian with unit variance.

Here’s the Python code that sets up the data set.

https://medium.com/media/f6b0e90a68aef8629029df4e3c5f844e/href

Fitting Models

Now, we’ll fit a Bayesian ridge regression model, an OLS model, and a ridge regression model with the regularization strength set so that mean squared error is minimized on a LOOCV.

https://medium.com/media/c0cf705c061dcd57fe537e9af5a0fdcd/href

We can measure the out-of-sample error variance for each model using the equation

https://medium.com/media/3e5f5930b4256374e6bc6c795e5498a9/href

Doing this gives us

https://medium.com/media/3129c728a67f00f0ae955eb3f6ef96c0/href

We see that both the Bayesian and ridge regression models are able to prevent overfitting and achieve better out-of-sample results.

Finally, we’ll compare the estimated noise variance from the Bayesian model to that from the OLS model.

https://medium.com/media/22191e4c3854ec41d4542f90fb4785a1/href

This gives us

https://medium.com/media/5062fcb761254ffad917d749827a010a/href

Conclusion

When applying Bayesian methods to ridge regression, we need to address: how do we handle the hyperparameter that controls regularization strength?

One option is to use a point estimate, where a value of the hyperparameter is chosen to optimize some metric (e.g. likelihood or a cross-validation). But such an approach goes against the Bayesian methodology of using probability distributions expressing belief for parameters, particularly if the likelihood is not strongly peaked about a particular value of the hyperparameter.

Another option commonly used is to apply a known distribution for the hyperparameter prior (for example a half-Cauchy distribution). Such an approach gives us a posterior distribution for the hyperparameter and can work fairly well in practice, but the choice of prior distribution is somewhat arbitrary.

In this blog post, we showed a different approach where we selected a hyperparameter prior distribution so as to be approximately noninformative. We developed algorithms to compute standard prediction statistics given the noninformative prior, and we demonstrated on a small example that it compared favorably to using a point estimate of the hyperparameter selected to optimize a leave-one-out cross-validation.

Appendix A: Comparison with Other Bayesian Algorithms

Here, we’ll give a brief comparison of the algorithm presented to scikit-learn’s algorithm for Bayesian ridge regression.

Scikit-learn’s algorithm makes use of conjugate priors and because of that is restricted to use the Gamma prior which requires four hyperparameters chosen arbitrarily to be small values. Additionally, it requires initial values for parameters α and λ that are then updated from the data.

The algorithm’s performance can be sensitive to the choice of values for these parameters, and scikit-learn’s documentation provides a curve fitting example where the defaults perform poorly.

https://medium.com/media/d78d5193b1d2ce83bc16c922fd3c44e6/href

In comparison, the algorithm we presented requires no initial parameters; and because the hyperparameter is integrated over, poor performing values contribute little to the posterior probability mass.

We can see that our approach handles the curve-fitting example without requiring any tweaking.

https://medium.com/media/d086c11e3eb27ddc16d4b8b63e1c6559/href

The full curve-fitting comparison example is available at github.com/rnburn/bbai/blob/master/example/04-curve-fitting.

This blog post was originally published at buildingblock.ai/bayesian-ridge-regression.

How to Build a Bayesian Ridge Regression Model with Full Hyperparameter Integration was originally published in TDS Archive on Medium, where people are continuing the conversation by highlighting and responding to this story.

Why Standard C++ Math Functions Are Slow

Ryan Burn — Sun, 27 Dec 2020 03:37:05 GMT

Assembly for a C++ function computing square roots

Performance has always been a high priority for C++, yet there are many examples both in the language and the standard library where compilers produce code that is significantly slower than what a machine is capable of. In this blog post, I’m going to explore one such example from the standard math library.

Suppose we’re tasked with computing the square roots of an array of floating point numbers. We might write a function like this to perform the operation:

https://medium.com/media/c46f885539cec764826566e2ef42edf5/href

If we’re using gcc, we can compile the code with

g++ -c -O3 -march=native sqrt1.cpp

With -O3, gcc will optimize the code heavily but will still produce code that is standard compliant. The -march=native option tells gcc to produce code targeting the native architecture’s instruction set. The resulting binaries may not be portable even between different x86-64 CPUs.

Now, let’s benchmark the function. We’ll use google benchmark to measure how long it takes to compute the square roots of 1,000,000 numbers:

https://medium.com/media/477ef4b70f73d2bd4773371e3bc2c734/href

Compiling our benchmark and running we get

g++ -O3 -march=native -o benchmark benchmark.cpp sqrt1.o
./benchmark
Running ./benchmark
Run on (6 X 2600 MHz CPU s)
CPU Caches:
 L1 Data 32 KiB (x6)
 L1 Instruction 32 KiB (x6)
 L2 Unified 256 KiB (x6)
 L3 Unified 9216 KiB (x6)
Load Average: 0.17, 0.07, 0.05
 — — — — — — — — — — — — — — — — — — — — — — — — — — — — — -
Benchmark Time CPU Iterations
 — — — — — — — — — — — — — — — — — — — — — — — — — — — — — -
BM_Sqrt1/1000000 4984457 ns 4946631 ns 115

Can we do better? Let try this version:

https://medium.com/media/ee9de43635c977c3177bfa21b1d200cd/href

and compile with

g++ -c -O3 -march=native -fno-math-errno sqrt2.cpp

The only difference between compute_sqrt1 and compute_sqrt2 is that we added the extra option -fno-math-errno when compiling. I’ll explain later what -fno-math-errno does; but for now, I’ll only point out that the produced code is no longer standard compliant.

Let’s benchmark compute_sqrt2.

https://medium.com/media/f74ac7944b974f1a8e1da8a72526298b/href

Running

g++ -O3 -march=native -o benchmark benchmark.cpp sqrt2.o
./benchmark

we get

Running ./benchmark
Run on (6 X 2600 MHz CPU s)
CPU Caches:
 L1 Data 32 KiB (x6)
 L1 Instruction 32 KiB (x6)
 L2 Unified 256 KiB (x6)
 L3 Unified 9216 KiB (x6)
Load Average: 0.17, 0.07, 0.05
 — — — — — — — — — — — — — — — — — — — — — — — — — — — — — -
Benchmark Time CPU Iterations
 — — — — — — — — — — — — — — — — — — — — — — — — — — — — — -
BM_Sqrt2/1000000 1195070 ns 1192078 ns 553

Yikes! compute_sqrt2 is more than 4 times faster than compute_sqrt1.

What’s different? Let’s drill down into the assembly to find out. We can produce the assembly for the code by running

g++ -S -c -O3 -march=native sqrt1.cpp
g++ -S -c -O3 -march=native -fno-math-errno sqrt2.cpp

The result will depend on what architecture you’re using, but looking at sqrt1.s on my architecture, we see this section

.L3:
 vmovsd (%rdi), %xmm0
 vucomisd %xmm0, %xmm2
 vsqrtsd %xmm0, %xmm1, %xmm1
 ja .L12
 addq $8, %rdi
 vmovsd %xmm1, (%rdx)
 addq $8, %rdx
 cmpq %r12, %rdi
 jne .L3

Let’s break down the first few instructions:

1: vmovsd (%rdi), %xmm0 
 # Load a value from memory into the register %xmm0
2: vucomisd %xmm0, %xmm2
 # Compare the value of %xmm0 with %xmm2 and set the register
 # EFLAGS with the result
3: vsqrtsd %xmm0, %xmm1, %xmm1 
 # Compute the square root of %xmm0 and store in %xmm1
4: ja .L12 
 # Inspects EFLAGS and jumps if %xmm2 is above %xmm0

What are instructions 3 and 4 for? Recall that for real numbers, sqrt is undefined on negative values. When std::sqrt is passed a negative number, the C++ standard requires that it return the special floating point value NaN and that it set the global variable errno to EDOM. But that error handling ends up being really expensive.

If we look at sqrt2.s, we see these instructions for the main loop:

.L6:
 addl $1, %r8d
 vsqrtpd (%r10,%rax), %ymm0
 vextractf128 $0x1, %ymm0, 16(%rcx,%rax)
 vmovups %xmm0, (%rcx,%rax)
 addq $32, %rax
 cmpl %r8d, %r11d
 ja .L6

Without the burden of having to do error handling, gcc can produce much faster code. vsqrtpd is what’s known as a Single Instruction Multiple Data (SIMD) instruction. It computes the the square root of four double precision floating point numbers at a time. For computationally expensive functions like sqrt, vectorization helps a lot.

It’s unfortunate that the standard requires such error handling. It’s so much slower to do the error checking that many compilers like Intel’s icc and Apple’s default clang-based compiler opt out of the error handling by default. Even if we want std::sqrt do error handling, we can’t portably rely on major compilers to do so.

The complete benchmark can be found at rnburn/cmath-bechmark. This story wasa originally published at https://ryanburn.com/2020/12/26/why-c-standard-math-functions-are-slow/

Why Standard C++ Math Functions Are Slow was originally published in ITNEXT on Medium, where people are continuing the conversation by highlighting and responding to this story.