I haven’t read any of the papers or studied the Kahan summation method, but it seems like there should be a fairly straightforward somewhat parallelizable algorithm to develop an answer for the most accurate possible sum.
I will assume a binary floating point representation such as IEEE-754, which consists of a fixed-width bit field containing the mantissa and a fixed-width bit field containing the exponent. For the purpose of illustration I will assume that all numbers are positive, but I don’t think extending this to handle both positive and negative numbers presents any particular challenge.
Let’s consider each number to be summed to include a high-set-bit and a low-set-bit. Each bit will have an appropriate place-value identification. Both of these bits are set, and the difference in place-value cannot exceed the width of the mantissa in the representation. In between the high-set-bit and low-set-bit for a number, there are a collection of other bits of values of zero or one, as indicated by the mantissa.
Across the array of all numbers to be summed, determine the highest place-value of the high-set-bit, and the lowest place value of the low-set-bit across the array of numbers. This could be performed in parallel using max and min reductions.
Starting from the lowest place value of the low-set-bit, up to the highest place value of the high-set-bit, do:
- identify which values have a set bit at that place value. The result of this step is a zero or one, for each thread/number (this is trivially parallelizable)
- perform a sum reduction on the output from the previous step (parallelizable)
- add the result, if any, from the previous iteration, shifted right by one bit
- using the result from the previous step, assign a zero to the final answer bit position that corresponds to the place-value for this loop iteration, if the result of the previous step is even. Otherwise assign a one to the final answer bit position for this loop iteration.
- repeat for the next iteration/place-value/bit position
Although I have indicated that the above loop must proceed from the lowest set bit position to the highest set bit position, in fact it must proceed until the highest bit position and then until the result of adding the previous iteration result, shifted right by 1, is zero.
When the final result is thus assembled, it will consist of potentially a very long binary word. That very long binary word will need to be truncated/rounded to fit in the desired result representation.
The above arithmetic is essentially all integer arithmetic, and with appropriate choice of intermediate integer quantities (e.g. unsigned long long) it should be able to handle “arbitrary” array sizes.
The above method could certainly be optimized for performance. For example, arrangement of the numbers from largest to smallest would allow entire threadblocks to retire early, or even traverse over a fixed subset of the place value range, as determined by their subset of numbers.
It’s not obvious to me that any method could give a more accurate representation of the true result than this. Furthermore there should be no variability in result, from run-to-run, or from machine-to-machine.