Image

Communities

Writing
Writing
Codidact Meta
Codidact Meta
The Great Outdoors
The Great Outdoors
Photography & Video
Photography & Video
Scientific Speculation
Scientific Speculation
Cooking
Cooking
Electrical Engineering
Electrical Engineering
Judaism
Judaism
Languages & Linguistics
Languages & Linguistics
Software Development
Software Development
Mathematics
Mathematics
Christianity
Christianity
Code Golf
Code Golf
Music
Music
Physics
Physics
Linux Systems
Linux Systems
Power Users
Power Users
Tabletop RPGs
Tabletop RPGs
Community Proposals
Community Proposals
tag:snake search within a tag
answers:0 unanswered questions
user:xxxx search by author id
score:0.5 posts with 0.5+ score
"snake oil" exact phrase
votes:4 posts with 4+ votes
created:<1w created < 1 week ago
post_type:xxxx type of post
Search help
Notifications
Mark all as read See all your notifications »
Q&A

Welcome to Software Development on Codidact!

Will you help us build our independent community of developers helping developers? We're small and trying to grow. We welcome questions about all aspects of software development, from design to code to QA and more. Got questions? Got answers? Got code you'd like someone to review? Please join us.

Uniform floating point from unsigned integer

+2
−2

How can I turn uniform samples of unsigned n-bit (n=16/32/64) integers (over the whole range of integers) into uniform samples of n-bit IEEE floats in [0, 1]? By uniform, I mean the continuous uniform distribution. All bounds are inclusive.

It needs to be accurate enough for stats/ML.

I'm using the StableHLO API, but an answer in e.g. C with equivalent maths and bit ops would be trivial to translate. Note this API can bitcast naively to float (i.e. reinterpret bits as they are as a float), and cast to float using the (approx) numerical value.

NB. I assume I can scale [0, 1] to [a, b] as samples * (a - b) + b since that's what XLA does (though I will have to be careful about overflow).

History

4 comment threads

Undeleted because of timing: an answer was being added as the question was being deleted. (1 comment)
Definition of "uniform" (12 comments)
Uniform over what ranges, exactly? (4 comments)
The most naive "bitcast" (7 comments)

1 answer

+1
−0

It's not really clear what you are asking, since your concept of "uniform" is poorly defined. I'll take it to mean that you want the difference between any two resulting floating point values to be proportional to their difference in the original integer space. If that's true, and it's not clear it is, it would have been helpful if you simply said so in plain English.

The reason this is hard to fathom, is that it makes little sense to convert to floating point when the existing integer representations give you exactly what you're looking for. Floating point has its purposes, like a minimum guaranteed resolution over a wide dynamic range. However, what you are asking for isn't one of them.

The problem with floating point is that the difference between adjacent values changes with the exponent. To know when exponents change and what the difference between adjacent values are for each exponent, you have to know the floating point representation.

For example, the common IEEE "32 bit" floating point format has a 1.23 fixed point number scaled by a power of 2 specified by the exponent. Since the high bit (the integer part) of the 1.23 fixed point number is always 1, it is not stored. It is referred to as the "vestigial one". Therefore, this format can have up 223 distinct values in any one "uniform" range.

To guarantee a uniform range in the IEEE 32 bit format, scale the entire possible input range so that it fits within a single exponent. For example, 1.00000000000000000000000 to 1.11111111111111111111111. You are adjusting the values to fit in what is effectively a 23 bit integer space within the 32 bit FP format.

The above mapping works directly without loss for all input integers up to 23 bits wide. For wider integers, you can either accept some loss, or use a wider floating point representation. Sticking with 32 bit FP, you essentially discard all but the high 23 bits of each input integer. The result is still "uniform", but there will be multiple input values that result in each possible output value.

The common IEEE 64 bit (often called "double precision") uses a 1.52 fixed point format that the exponent is applied to. That format can represent up to 52 bit input integers without loss.

Again though, code that does these conversions will be for the specific output floating point format. The world now largely uses either the 32 or 64 bit formats described above on main stream general computing platforms, with an 80 bit format used internally for intermediate calculations. However, largely is not the same as always, and there are many processors that are not for general computing that do not have a prescribed floating point format.

History

0 comment threads

Sign up to answer this question »