Fast cuDNN BatchNorm NHWC kernels support#20615

mk-61 · 2021-09-27T22:22:24Z

Description

This PR makes cuDNN-backed BatchNorm operator use newer API calls (cudnnBatchNormalizationForwardTrainingEx / cudnnBatchNormalizationBackwardEx), which bring in significant speed up in some cases (fp16 NHWC / NDHWC layouts).

I also refactored and simplified code a bit.

I tested fp16 NHWC speedup on my Layout Management feature branch (not up-streamed yet) on ResNet50 model.
The correctness should be covered by existing tests.

Checklist

Essentials

Changes are complete (i.e. I finished coding on this PR)
All changes have test coverage
Code is well-documented

Changes

Make use of newer cuDNN API calls / new kernels
Refactoring

@DickJC123

mxnet-bot · 2021-09-27T22:22:28Z

Hey @mk-61 , Thanks for submitting the PR
All tests are already queued to run once. If tests fail, you can trigger one or more tests again with the following commands:

To trigger all jobs: @mxnet-bot run ci [all]
To trigger specific jobs: @mxnet-bot run ci [job1, job2]

CI supported jobs: [sanity, windows-cpu, miscellaneous, website, windows-gpu, unix-gpu, centos-cpu, unix-cpu, clang, edge, centos-gpu]

Note:
Only following 3 categories can trigger CI :PR Author, MXNet Committer, Jenkins Admin.
All CI tests must pass before the PR can be merged.

mk-61 · 2021-09-28T16:36:45Z

@mxnet-bot run ci [centos-gpu, unix-cpu, website]

mxnet-bot · 2021-09-28T16:36:52Z

Jenkins CI successfully triggered : [centos-gpu, website, unix-cpu]

mk-61 · 2021-09-28T19:30:48Z

@mxnet-bot run ci [unix-cpu]

mxnet-bot · 2021-09-28T19:30:54Z

Jenkins CI successfully triggered : [unix-cpu]

ptrendx

LGTM, did you also check the performance of NCHW case?

mk-61 · 2021-09-28T20:30:01Z

LGTM, did you also check the performance of NCHW case?

You mean compared to functions without "Ex" suffix? Not, I haven't, can do if you like me to. Although I think the logic behind "Ex" functions is "make it faster in some case and fallback to the previous implementations otherwise". Specifically, I expected (and verified) speedup in FP16/NHWC, assumed it shouldn't regress in other cases, unless there's a bug, which cuDNN would need to fix.

ptrendx · 2021-09-28T20:34:12Z

Yeah, it would be good to check that NCHW does not regress.

mk-61 · 2021-09-28T22:29:43Z

Yeah, it would be good to check that NCHW does not regress.

Verified on RN50 / Volta - no regressions and the same kernels used, as far as nsys stats show.

ptrendx · 2021-09-28T22:41:17Z

@mxnet-bot run ci [unix-cpu]

mxnet-bot · 2021-09-28T22:41:23Z

Jenkins CI successfully triggered : [unix-cpu]

ptrendx · 2021-09-30T17:10:32Z

Thanks for the contribution!

* Fast cuDNN NHWC kernels support * Fix lint errors * Get rid of a warning * Remove CuDNNBatchNorm from AMP lists Co-authored-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

mseth10 added pr-awaiting-testing PR is reviewed and waiting CI build and test pr-work-in-progress PR is still work in progress and removed pr-awaiting-testing PR is reviewed and waiting CI build and test labels Sep 27, 2021

mk-61 changed the title ~~Fast cuDNN NHWC kernels support~~ Fast cuDNN BatchNorm NHWC kernels support Sep 27, 2021

mseth10 added pr-awaiting-testing PR is reviewed and waiting CI build and test pr-work-in-progress PR is still work in progress and removed pr-work-in-progress PR is still work in progress pr-awaiting-testing PR is reviewed and waiting CI build and test labels Sep 27, 2021

vcherepanov-nv added 3 commits September 27, 2021 18:28

Fast cuDNN NHWC kernels support

797d1f3

Fix lint errors

bdbfdd4

Get rid of a warning

44c697f

mk-61 force-pushed the pr-cudnn-batchnorm-ex branch from 4b30f4b to 44c697f Compare September 28, 2021 01:31

mseth10 added pr-awaiting-testing PR is reviewed and waiting CI build and test pr-work-in-progress PR is still work in progress and removed pr-work-in-progress PR is still work in progress pr-awaiting-testing PR is reviewed and waiting CI build and test labels Sep 28, 2021

Remove CuDNNBatchNorm from AMP lists

01f4edf

mk-61 requested a review from szha as a code owner September 28, 2021 03:34

mseth10 added pr-awaiting-testing PR is reviewed and waiting CI build and test pr-work-in-progress PR is still work in progress and removed pr-work-in-progress PR is still work in progress pr-awaiting-testing PR is reviewed and waiting CI build and test labels Sep 28, 2021

mseth10 added pr-awaiting-testing PR is reviewed and waiting CI build and test and removed pr-work-in-progress PR is still work in progress labels Sep 28, 2021

ptrendx approved these changes Sep 28, 2021

View reviewed changes

mseth10 added pr-work-in-progress PR is still work in progress and removed pr-awaiting-testing PR is reviewed and waiting CI build and test labels Sep 28, 2021

mseth10 added pr-awaiting-testing PR is reviewed and waiting CI build and test pr-awaiting-merge Review and CI is complete. Ready to Merge and removed pr-work-in-progress PR is still work in progress pr-awaiting-testing PR is reviewed and waiting CI build and test labels Sep 28, 2021

ptrendx merged commit 23af413 into apache:master Sep 30, 2021

ToucheSir mentioned this pull request Feb 8, 2022

Allow user to change the channel axis for BatchNorm function and the likes FluxML/Flux.jl#1666

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fast cuDNN BatchNorm NHWC kernels support#20615