-
-
Notifications
You must be signed in to change notification settings - Fork 33.7k
bpo-31773: _PyTime_GetPerfCounter() uses _PyTime_t #3983
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
* Rewrite win_perf_counter() to only use integers internally. * Add _PyTime_MulDiv() which compute "ticks * SEC_TO_NS / perf_freq" in two parts (seconds and nanoseconds) to prevent integer overflow. * Clock frequency is checked at initialization for integer overflow. * Enhance also pymonotonic() to reduce the precision loss on macOS (mach_absolute_time() clock).
|
I found a solution to prevent integer overflow with a new _PyTime_MulDiv() by computing the result in two parts. This PR should now avoids any precision loss in _PyTime_GetPerfCounter() on Windows (QueryPerformanceCounter) and Linux (clock_gettime). Only the final conversion from _PyTime_t (int) to double in time.perf_counter() can loose precision, but only after 104 days: |
Add another sanity check: timebase.denom < _PyTime_t
methane
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know much about mach_absolute_time().
But code looks good and API looks far better than current.
|
|
||
| #if defined(MS_WINDOWS) || defined(__APPLE__) | ||
| Py_LOCAL_INLINE(_PyTime_t) | ||
| _PyTime_MulDiv(_PyTime_t ticks, _PyTime_t mul, _PyTime_t div) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_PyTime_t is a time duration in nanoseconds. Neither of arguments is a _PyTime_t. I would use long long.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer to use _PyTime_t type for the 3 parameters to not have to worry about casting or handling integer overflow.
Using the _PyTime_t type doesn't mean that the value must be a duration. _PyTime_t is just a signed integer ;-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I see that _PyTime_t already is used not only for time in nanoseconds.
Python/pytime.c
Outdated
| Check also that timebase.numer and timebase.denom can be casted to | ||
| _PyTime_t. */ | ||
| if (timebase.numer > _PyTime_MAX / timebase.denom | ||
| || timebase.denom > _PyTime_MAX) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
timebase.denom > _PyTime_MAX is redundant. If timebase.denom > _PyTime_MAX then _PyTime_MAX / timebase.denom == 0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While it's correct, I prefer to keep the test to be very explicit. The test is only run once, it doesn't matter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe than add the explicit test timebase.numer > _PyTime_MAX?
Additional timebase.denom > _PyTime_MAX doesn't add clarity to me. It rather make me puzzling and spending a minute for figuring out why this test is added explicitly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe than add the explicit test timebase.numer > _PyTime_MAX?
Done
Python/pytime.c
Outdated
| compute the number of seconds ("integer part"), and then the remaining | ||
| number of nanoseconds ("floating part"). | ||
| The caller has to check that ticks * mul, with ticks < div, cannot |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can support even larger mul if represent the fraction mul / div as int + mul2 / div, where int = round(mul/div). But it seems there is no need to do this now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may be useful tomorrow if we need to support other weird clocks, but I confirm that I don't see the need for that right now.
CPU frequencies seems to be stuck at 4 GHz for physical reasons, and the maximum mul accepted by win_perf_counter() is (263-1)//(109) which is larger than 9 GHz. Moreover, recent Windows versions now use a fixed frequency of 10^7 (10 MHz): a resolution of 100 ns. QueryPerformanceFrequency() doesn't seem to be related to the CPU frequency anymore.
ncoghlan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this version. Some readability suggestions inline for the revised integer arithmetic, but I don't consider them blockers.
Python/pytime.c
Outdated
| compute the number of seconds ("integer part"), and then the remaining | ||
| number of nanoseconds ("floating part"). | ||
| The caller has to check that ticks * mul, with ticks < div, cannot |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This constraint would be more clearly expressed as:
The caller must ensure that "div * mul" cannot overflow, as this ensures that neither 'ticks / div * mul' (the seconds calculation) nor '(ticks % div) * mul / div' (the nanosecond calculation) will overflow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I rephrased (and simpliifed) this comment.
Python/pytime.c
Outdated
| ticks %= div; | ||
| nsec = ticks * mul; | ||
| nsec /= div; | ||
| return sec * mul + nsec; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering if it might be clearer to spell out the full calculations, and then rely on the compiler being smart enough to figure out whether or not there are any subcalculations it can re-use:
sec = ticks / div * mul; /* Avoid overflow in the integer part */
nsec = (ticks % div) * mul / div; /* Avoid loss of precision in the fractional part */
return sec + nsec
If you write it out that way, then the suggested comment above could be simplified to:
The caller must ensure that "div * mul" cannot overflow, as this ensures that neither the seconds calculation nor the nanoseconds calculation will overflow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer to leave the code as it is. I prefer to write explicit code and don't rely on the computer to be smart.
I just hope that some smart compilers are able to compile "ticks / div" and "ticks % div" as a single CPU instruction ;-) (At least, Intel AMD64 is able to do that.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nick's shorter variant looks clearer to me too.
But the name sec looks misleading. This isn't a time in seconds, and if it would be, adding seconds and nanoseconds would be wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good point about the names: sec_part and nsec_part would be more accurate (since they're both in units of nanoseconds, it's just that the first one has a zero nanosecond component, while the latter has a zero seconds component).
Regarding "what are compilers and CPUs good at optimising?", the thing you really, really, really want to avoid is pipeline stalls as they wait to touch the same piece of memory multiple times without enough other work in between. Setting out the calculations as distinct operations makes it clear to the compiler that the two calculations are entirely independent of each other, and the only time they need to be synchronised is in the final addition.
So that's why I favour the approach of making the algorithmic intent as clear as possible, for the benefit of both human readers and compilers, and in this case, that means making the two expressions independent of each other rather than requiring the reader to track the local variable mutation and reverse engineer the original expressions from that.
Hand-optimisation might make sense if a profiler indicated this was a hotspot, but even then, I'd be surprised if switching from "two independent subexpressions" to "touch the same local variable multiples times" actually sped things up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I renamed variables to intpart and remaining since I'm trying to not hardcode the _PyTime_t unit in pytime.c. The API is "unit agnostic".
Exchange also overflow check on mach_timebase_info() to keep the explicit "timebase.denom > _PyTime_MAX" check, even if it's technically redundant.
serhiy-storchaka
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Besides few style nitpicks in general the PR LGTM.
|
|
||
| #if defined(MS_WINDOWS) || defined(__APPLE__) | ||
| Py_LOCAL_INLINE(_PyTime_t) | ||
| _PyTime_MulDiv(_PyTime_t ticks, _PyTime_t mul, _PyTime_t div) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I see that _PyTime_t already is used not only for time in nanoseconds.
Python/pytime.c
Outdated
| ticks %= div; | ||
| nsec = ticks * mul; | ||
| nsec /= div; | ||
| return sec * mul + nsec; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nick's shorter variant looks clearer to me too.
But the name sec looks misleading. This isn't a time in seconds, and if it would be, adding seconds and nanoseconds would be wrong.
Python/pytime.c
Outdated
| Check also that timebase.numer and timebase.denom can be casted to | ||
| _PyTime_t. */ | ||
| if (timebase.numer > _PyTime_MAX / timebase.denom | ||
| || timebase.denom > _PyTime_MAX) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe than add the explicit test timebase.numer > _PyTime_MAX?
Additional timebase.denom > _PyTime_MAX doesn't add clarity to me. It rather make me puzzling and spending a minute for figuring out why this test is added explicitly.
sec/nsec = intpart/remaining
|
Hum, it's not convenient to work on this PR since the modified code is only compiled on macOS and Windows. I have to use 3 computers to work on it. I fixed the last warnings on macOS and Windows. I tested manually time.time() and ran test_time on macOS and Windows. Everything seems fine (to me). I'm now waiting for Travis CI + AppVeyor to merge this PR. |
in two parts (seconds and nanoseconds) to prevent integer overflow.
(mach_absolute_time() clock).
https://bugs.python.org/issue31773