-
-
Notifications
You must be signed in to change notification settings - Fork 14.7k
performance of saturating_mul can be improved by removing branches #65309
Copy link
Copy link
Closed
Labels
A-LLVMArea: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues.Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues.C-bugCategory: This is a bug.Category: This is a bug.I-slowIssue: Problems and improvements with respect to performance of generated code.Issue: Problems and improvements with respect to performance of generated code.T-compilerRelevant to the compiler team, which will review and decide on the PR/issue.Relevant to the compiler team, which will review and decide on the PR/issue.
Metadata
Metadata
Assignees
Labels
A-LLVMArea: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues.Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues.C-bugCategory: This is a bug.Category: This is a bug.I-slowIssue: Problems and improvements with respect to performance of generated code.Issue: Problems and improvements with respect to performance of generated code.T-compilerRelevant to the compiler team, which will review and decide on the PR/issue.Relevant to the compiler team, which will review and decide on the PR/issue.
Type
Fields
Give feedbackNo fields configured for issues without a type.
While playing with
saturating_multo see if it can be made branchless and thus a const fn (sorry @RalfJung ;) ), I found that the performance of signedsaturating_mulcan actually be improved.I tested two implementations:
The
cond_if_elsefunction acts kind of like C's ternary operator, including its constness, but works only for integers.The second case seems to have a reciprocal throughput better by a factor of about 1.5 according to llvm-mca: https://godbolt.org/z/6PnCwB
For unsigned, the second case can become:
In this case, there is no performance improvement and LLVM reduces this to the same IR, so the only advantage here would be constness. https://godbolt.org/z/0Fs0AD
I found some similar improvements for
saturating_sub, even though currently the implementation usesintrinsics::saturating_sub; I found it a bit weird that the intrinsic performs worse. The reciprocal throughput can be improved by a factor of 1.13: https://godbolt.org/z/tcZgeEMy concern is that since the LLVM IR is different, even though llvm-mca indicates the difference is an improvement, there might be some case I don't know about where the performance regresses. Maybe there are other platforms where the saturating intrinsics result better performance? I think @nikic has a better understanding of these concerns.