Welcome to Software Development on Codidact!
Will you help us build our independent community of developers helping developers? We're small and trying to grow. We welcome questions about all aspects of software development, from design to code to QA and more. Got questions? Got answers? Got code you'd like someone to review? Please join us.
Uniform floating point from unsigned integer
How can I turn uniform samples of unsigned n-bit (n=16/32/64) integers (over the whole range of integers) into uniform samples of n-bit IEEE floats in [0, 1]? By uniform, I mean the continuous uniform distribution. All bounds are inclusive.
It needs to be accurate enough for stats/ML.
I'm using the StableHLO API, but an answer in e.g. C with equivalent maths and bit ops would be trivial to translate. Note this API can bitcast naively to float (i.e. reinterpret bits as they are as a float), and cast to float using the (approx) numerical value.
NB. I assume I can scale [0, 1] to [a, b] as samples * (a - b) + b since that's what XLA does (though I will have to be careful about overflow).
1 answer
It's not really clear what you are asking, since your concept of "uniform" is poorly defined. I'll take it to mean that you want the difference between any two resulting floating point values to be proportional to their difference in the original integer space. If that's true, and it's not clear it is, it would have been helpful if you simply said so in plain English.
The reason this is hard to fathom, is that it makes little sense to convert to floating point when the existing integer representations give you exactly what you're looking for. Floating point has its purposes, like a minimum guaranteed resolution over a wide dynamic range. However, what you are asking for isn't one of them.
The problem with floating point is that the difference between adjacent values changes with the exponent. To know when exponents change and what the difference between adjacent values are for each exponent, you have to know the floating point representation.
For example, the common IEEE "32 bit" floating point format has a 1.23 fixed point number scaled by a power of 2 specified by the exponent. Since the high bit (the integer part) of the 1.23 fixed point number is always 1, it is not stored. It is referred to as the "vestigial one". Therefore, this format can have up 223 distinct values in any one "uniform" range.
To guarantee a uniform range in the IEEE 32 bit format, scale the entire possible input range so that it fits within a single exponent. For example, 1.00000000000000000000000 to 1.11111111111111111111111. You are adjusting the values to fit in what is effectively a 23 bit integer space within the 32 bit FP format.
The above mapping works directly without loss for all input integers up to 23 bits wide. For wider integers, you can either accept some loss, or use a wider floating point representation. Sticking with 32 bit FP, you essentially discard all but the high 23 bits of each input integer. The result is still "uniform", but there will be multiple input values that result in each possible output value.
The common IEEE 64 bit (often called "double precision") uses a 1.52 fixed point format that the exponent is applied to. That format can represent up to 52 bit input integers without loss.
Again though, code that does these conversions will be for the specific output floating point format. The world now largely uses either the 32 or 64 bit formats described above on main stream general computing platforms, with an 80 bit format used internally for intermediate calculations. However, largely is not the same as always, and there are many processors that are not for general computing that do not have a prescribed floating point format.

4 comment threads