Floating Point Binary GCD

16th March, 2026

Introduction

We analyse an efficient variation on left-shift binary GCD algorithms, and show how floating-point representations can be used to increase throughput for 32-bit and 64-bit values on NVIDIA Ampere and later.

Left-Shift Binary GCD Variants

While optimising Greatest Common Divisor for Tenstorrent, I happened across various alternative binary GCD algorithms that involve left-shifting instead of right-shifting.

The earliest reference I could find is R. P. Brent, “Analysis of the binary Euclidean algorithm” (1976). The following is a modern rendition, using clz to compute the exact left-shift amount instead of using a loop:

def brent_gcd(a, b):
    # invariant: a ≥ b
    if a < b:
        a, b = b, a
    while b != 0:
        e = clz(b) - clz(a)
        t = b << e
        if t > a:
            # e -= 1
            t >>= 1
        a -= t
        if a < b:
            a, b = b, a
    return a

Shallit & Sorenson analyse a variant of this in Analysis of a Left-Shift Binary Greatest Common Divisor (GCD) Algorithm, with the addition of a min(a - t, (t << 1) - a) step.

def shallit_gcd(a, b):
    # invariant: a ≥ b
    if a < b:
        a, b = b, a
    while b != 0:
        # find e ≥ 0 such that 2**e b ≤ a ≤ 2**(e+1) b
        e = clz(b) - clz(a)
        t = b << e
        if t > a:
            # e -= 1
            t >>= 1
        a, b = b, min(a - t, (t << 1) - a)
        # maintain invariant a ≥ b
        if a < b:
            a, b = b, a
    return a

Shallit & Sorenson’s variant has some nice properties, such as a worst-case number of iterations similar to Stein’s (right-shift) binary GCD, i.e. just over k iterations for k-bit inputs, and fewer iterations in the average case.

Variant Average Case Worst Case
Stein (right-shift) 0.7060 k k
Brent 0.8758 k 1.4404 k
Shallit & Sorenson ~0.67 k 1.0527 k

However, it’s considerably more expensive than Stein’s algorithm due to effectively computing and testing three possible shifted values.

Single-Shift Absolute-Difference Variant

Consider the following variant for the hot loop, which uses only a single shift value to align the MSBs, followed by using the absolute value of the difference.

# assume a ≥ b
e = clz(b) - clz(a)
t = b << e
a = abs(a - t)
b, a = swap_min_max(b, a) # restore invariant

This hot loop is considerably more efficient, but at the expense of requiring slightly more iterations for random input values: an average of ~24 steps compared to ~22 steps for Stein’s algorithm for 32-bit values.

Using GPT-5.4, I wrote an exact reverse branch-and-bound solver for the reduced state graph, computing exact maxima for all k ≤ 128. The pruning rule relies on a strong upper bound on the remaining number of steps, which GPT-5.4 also formally verified with Lean.

The worst case is no more than \left\lfloor \frac{5k - 1}{3} \right\rfloor steps for k-bit inputs, making it unsuitable for use on Tenstorrent’s vector engine, which has no way to easily exit a loop early, meaning the loop must run for the worst-case number of iterations.

Floating Point Variants on NVIDIA

Observe that instead of left-shifting integers to align MSBs, we can instead represent values in fp32 or fp64, and set the exponent of the smaller value to that of the larger value. This can be achieved efficiently via a single lop3 operation, since the mantissa field of the smaller value and the exponent field of the larger value occupy disjoint bit ranges.

This only works for integers that are exactly representable in floating point, i.e. u24 for fp32 and u53 for fp64.

The algorithm then performs a floating-point subtraction, followed by fmin/fmax to restore the ordering invariant.

On NVIDIA Ampere, the fp32/u24 update lowers to the following four-instruction SASS sequence. The subtraction appears as FADD with a negated source operand, and the absolute value is folded into the FMNMX source modifiers:

LOP3.LUT  R2, R8, 0x7fffff, R7, ... ; INT: aligned_b = {a.Exp, b.Man}
FADD      R3, -R2, R7               ; FP:  d = a - aligned_b
FMNMX     R7, |R3|, R8, !PT         ; FP:  a = max(b, abs(d))
FMNMX     R8, |R3|, R8, PT          ; FP:  b = min(b, abs(d))

For comparison, this is the corresponding six-instruction Stein update sequence:

BREV      R2, R4                      ; INT: reverse bits of d
FLO.U32.SH R3, R2                     ; INT: c = ctz(d)
SHF.R.U32.HI R6, RZ, R3, R4           ; INT: v = d >> c
IMNMX.U32 R8, R6.reuse, R7.reuse, !PT ; INT: hi = max(v, b)
IMNMX.U32 R7, R6, R7, PT              ; INT: b  = min(v, b)
IMAD.IADD R4, R8, 0x1, -R7            ; INT: d  = hi - b

The main benefit is that the floating-point formulation requires fewer instructions. However, on NVIDIA architectures where integer and floating-point instructions can issue on different execution resources, this can also improve utilisation by splitting the work across both kinds of pipelines. In particular, if there is relatively more FP32 throughput than INT32 throughput, this can increase the advantage further.

In order to support 32-bit and 64-bit inputs, we use hybrid variants: strip common powers of two as in Stein, run 8 Stein frontend iterations for u32 or 11 for u64, and then switch to the floating-point core once the operands are in the exact u24 or u53 regime.

Benchmarks

Modal benchmark summary for count=1048576, rounds=1024, launches=20, block_size=256:

GPU u24 speedup u32 speedup u53 speedup u64 speedup
NVIDIA Tesla T4 1.33x 1.24x 0.13x 0.17x
NVIDIA A100-SXM4-80GB 1.48x 1.39x 1.43x 1.36x
NVIDIA A100 80GB PCIe 1.67x 1.56x 1.44x 1.38x
NVIDIA H100 80GB HBM3 1.71x 1.50x 1.58x 1.40x
NVIDIA B200 1.83x 1.44x 1.55x 1.40x

Consumer GeForce cards such as the RTX 3090 and RTX 4090 have severely reduced fp64 throughput (typically 1/64 of fp32). Likewise, T4 does not benefit for u53 or u64, since its fp64 throughput is also severely rate-limited.

Notes