16th March, 2026
We analyse an efficient variation on left-shift binary GCD algorithms, and show how floating-point representations can be used to increase throughput for 32-bit and 64-bit values on NVIDIA Ampere and later.
While optimising Greatest Common Divisor for Tenstorrent, I happened across various alternative binary GCD algorithms that involve left-shifting instead of right-shifting.
The earliest reference I could find is R. P.
Brent, “Analysis of the binary Euclidean algorithm” (1976). The
following is a modern rendition, using clz to compute the
exact left-shift amount instead of using a loop:
def brent_gcd(a, b):
# invariant: a ≥ b
if a < b:
a, b = b, a
while b != 0:
e = clz(b) - clz(a)
t = b << e
if t > a:
# e -= 1
t >>= 1
a -= t
if a < b:
a, b = b, a
return aShallit & Sorenson analyse a variant of this in Analysis of a Left-Shift
Binary Greatest Common Divisor (GCD) Algorithm, with the addition of
a min(a - t, (t << 1) - a) step.
def shallit_gcd(a, b):
# invariant: a ≥ b
if a < b:
a, b = b, a
while b != 0:
# find e ≥ 0 such that 2**e b ≤ a ≤ 2**(e+1) b
e = clz(b) - clz(a)
t = b << e
if t > a:
# e -= 1
t >>= 1
a, b = b, min(a - t, (t << 1) - a)
# maintain invariant a ≥ b
if a < b:
a, b = b, a
return aShallit & Sorenson’s variant has some nice properties, such as a worst-case number of iterations similar to Stein’s (right-shift) binary GCD, i.e. just over k iterations for k-bit inputs, and fewer iterations in the average case.
| Variant | Average Case | Worst Case |
|---|---|---|
| Stein (right-shift) | 0.7060 k | k |
| Brent | 0.8758 k | 1.4404 k |
| Shallit & Sorenson | ~0.67 k | 1.0527 k |
However, it’s considerably more expensive than Stein’s algorithm due to effectively computing and testing three possible shifted values.
Consider the following variant for the hot loop, which uses only a single shift value to align the MSBs, followed by using the absolute value of the difference.
# assume a ≥ b
e = clz(b) - clz(a)
t = b << e
a = abs(a - t)
b, a = swap_min_max(b, a) # restore invariantThis hot loop is considerably more efficient, but at the expense of requiring slightly more iterations for random input values: an average of ~24 steps compared to ~22 steps for Stein’s algorithm for 32-bit values.
Using GPT-5.4, I wrote an exact reverse branch-and-bound solver for
the reduced state graph, computing exact maxima for all
k ≤ 128. The pruning rule relies on a strong upper bound on
the remaining number of steps, which GPT-5.4 also formally verified with
Lean.
The worst case is no more than \left\lfloor \frac{5k - 1}{3} \right\rfloor steps for k-bit inputs, making it unsuitable for use on Tenstorrent’s vector engine, which has no way to easily exit a loop early, meaning the loop must run for the worst-case number of iterations.
Observe that instead of left-shifting integers to align MSBs, we can
instead represent values in fp32 or fp64, and
set the exponent of the smaller value to that of the larger value. This
can be achieved efficiently via a single lop3 operation,
since the mantissa field of the smaller value and the exponent field of
the larger value occupy disjoint bit ranges.
This only works for integers that are exactly representable in
floating point, i.e. u24 for fp32 and
u53 for fp64.
The algorithm then performs a floating-point subtraction, followed by
fmin/fmax to restore the ordering
invariant.
On NVIDIA
Ampere, the fp32/u24 update lowers to the
following four-instruction SASS sequence. The subtraction appears as
FADD with a negated source operand, and the absolute value
is folded into the FMNMX source modifiers:
LOP3.LUT R2, R8, 0x7fffff, R7, ... ; INT: aligned_b = {a.Exp, b.Man}
FADD R3, -R2, R7 ; FP: d = a - aligned_b
FMNMX R7, |R3|, R8, !PT ; FP: a = max(b, abs(d))
FMNMX R8, |R3|, R8, PT ; FP: b = min(b, abs(d))For comparison, this is the corresponding six-instruction Stein update sequence:
BREV R2, R4 ; INT: reverse bits of d
FLO.U32.SH R3, R2 ; INT: c = ctz(d)
SHF.R.U32.HI R6, RZ, R3, R4 ; INT: v = d >> c
IMNMX.U32 R8, R6.reuse, R7.reuse, !PT ; INT: hi = max(v, b)
IMNMX.U32 R7, R6, R7, PT ; INT: b = min(v, b)
IMAD.IADD R4, R8, 0x1, -R7 ; INT: d = hi - bThe main benefit is that the floating-point formulation requires fewer instructions. However, on NVIDIA architectures where integer and floating-point instructions can issue on different execution resources, this can also improve utilisation by splitting the work across both kinds of pipelines. In particular, if there is relatively more FP32 throughput than INT32 throughput, this can increase the advantage further.
In order to support 32-bit and 64-bit inputs, we use hybrid variants:
strip common powers of two as in Stein, run 8 Stein frontend iterations
for u32 or 11 for u64, and then switch to the
floating-point core once the operands are in the exact u24
or u53 regime.
Modal benchmark summary for
count=1048576, rounds=1024,
launches=20, block_size=256:
| GPU | u24 speedup |
u32 speedup |
u53 speedup |
u64 speedup |
|---|---|---|---|---|
| NVIDIA Tesla T4 | 1.33x | 1.24x | 0.13x | 0.17x |
| NVIDIA A100-SXM4-80GB | 1.48x | 1.39x | 1.43x | 1.36x |
| NVIDIA A100 80GB PCIe | 1.67x | 1.56x | 1.44x | 1.38x |
| NVIDIA H100 80GB HBM3 | 1.71x | 1.50x | 1.58x | 1.40x |
| NVIDIA B200 | 1.83x | 1.44x | 1.55x | 1.40x |
Consumer GeForce cards such as the RTX 3090 and RTX 4090 have
severely reduced fp64 throughput (typically 1/64 of
fp32). Likewise, T4 does not benefit for u53
or u64, since its fp64 throughput is also
severely rate-limited.
Supporting
code, including an exact branch-and-bound search for maximum steps
for k-bit inputs, and a formally-verified proof of a bound
used in the search.
Adam P. Goucher
suggested extending the loop predicate with an early exit on
b == 1 to optimise for the case where the GCD has no odd
prime factors, which happens with probability 8 / \pi^2 \approx 81\%; I have not done this
here, to keep the comparison simpler.
Further reading: Daniel Lemire, “Fastest way to compute the greatest common divisor” and “Greatest common divisor, the extended Euclidean algorithm, and speed”.
Further reading: Algorithmica, “Greatest Common Divisor”.