Fast cuDNN BatchNorm NHWC kernels support#20615
Conversation
|
Hey @mk-61 , Thanks for submitting the PR
CI supported jobs: [sanity, windows-cpu, miscellaneous, website, windows-gpu, unix-gpu, centos-cpu, unix-cpu, clang, edge, centos-gpu] Note: |
4b30f4b to
44c697f
Compare
|
@mxnet-bot run ci [centos-gpu, unix-cpu, website] |
|
Jenkins CI successfully triggered : [centos-gpu, website, unix-cpu] |
|
@mxnet-bot run ci [unix-cpu] |
|
Jenkins CI successfully triggered : [unix-cpu] |
ptrendx
left a comment
There was a problem hiding this comment.
LGTM, did you also check the performance of NCHW case?
You mean compared to functions without "Ex" suffix? Not, I haven't, can do if you like me to. Although I think the logic behind "Ex" functions is "make it faster in some case and fallback to the previous implementations otherwise". Specifically, I expected (and verified) speedup in FP16/NHWC, assumed it shouldn't regress in other cases, unless there's a bug, which cuDNN would need to fix. |
|
Yeah, it would be good to check that NCHW does not regress. |
Verified on RN50 / Volta - no regressions and the same kernels used, as far as nsys stats show. |
|
@mxnet-bot run ci [unix-cpu] |
|
Jenkins CI successfully triggered : [unix-cpu] |
|
Thanks for the contribution! |
* Fast cuDNN NHWC kernels support * Fix lint errors * Get rid of a warning * Remove CuDNNBatchNorm from AMP lists Co-authored-by: Vladimir Cherepanov <vcherepanov@nvidia.com>
Description
This PR makes cuDNN-backed BatchNorm operator use newer API calls (cudnnBatchNormalizationForwardTrainingEx / cudnnBatchNormalizationBackwardEx), which bring in significant speed up in some cases (fp16 NHWC / NDHWC layouts).
I also refactored and simplified code a bit.
I tested fp16 NHWC speedup on my Layout Management feature branch (not up-streamed yet) on ResNet50 model.
The correctness should be covered by existing tests.
Checklist
Essentials
Changes
@DickJC123