Skip to content
This repository was archived by the owner on Nov 17, 2023. It is now read-only.

Back port optimization to broadcast_axis to MXNet1.x#18773

Merged
szha merged 2 commits intoapache:v1.xfrom
access2rohit:mx17
Jul 28, 2020
Merged

Back port optimization to broadcast_axis to MXNet1.x#18773
szha merged 2 commits intoapache:v1.xfrom
access2rohit:mx17

Conversation

@access2rohit
Copy link
Copy Markdown
Contributor

@access2rohit access2rohit commented Jul 23, 2020

Description

Back port optimization to broadcast_axis for CPU #17882 and GPU #18168 to MXNet1.7.x

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage:
  • Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
  • Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
  • Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
  • To the best of my knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

@mxnet-bot
Copy link
Copy Markdown

Hey @access2rohit , Thanks for submitting the PR
All tests are already queued to run once. If tests fail, you can trigger one or more tests again with the following commands:

  • To trigger all jobs: @mxnet-bot run ci [all]
  • To trigger specific jobs: @mxnet-bot run ci [job1, job2]

CI supported jobs: [sanity, unix-cpu, windows-cpu, centos-gpu, clang, website, miscellaneous, centos-cpu, windows-gpu, unix-gpu, edge]


Note:
Only following 3 categories can trigger CI :PR Author, MXNet Committer, Jenkins Admin.
All CI tests must pass before the PR can be merged.

@access2rohit
Copy link
Copy Markdown
Contributor Author

@leezu @sandeep-krishnamurthy can you help merge ? These are cherry-picked from master(already merged there)

@access2rohit access2rohit changed the base branch from v1.7.x to v1.x July 23, 2020 02:13
@access2rohit access2rohit changed the title Back port optimization to broadcast_axis to MXNet1.7.x Back port optimization to broadcast_axis to MXNet1.x Jul 23, 2020
@ChaiBapchya
Copy link
Copy Markdown
Contributor

Looks good since it's a cherry-pick.
For functionality check, broadcast_axis specific tests in unix-cpu should verify that on CI.
For performance check, do you mind pasting comparisons for w/ and w/o this change for 1.x branch just to be sure these perf achievements are valid for 1.x branch as well?

@access2rohit
Copy link
Copy Markdown
Contributor Author

Looks good since it's a cherry-pick.
For functionality check, broadcast_axis specific tests in unix-cpu should verify that on CI.
For performance check, do you mind pasting comparisons for w/ and w/o this change for 1.x branch just to be sure these perf achievements are valid for 1.x branch as well?

@ChaiBapchya
the results are identical to the one I posted on my original PR. These results haven't changed over time. Do you still want additional runs ?

@ChaiBapchya
Copy link
Copy Markdown
Contributor

ChaiBapchya commented Jul 24, 2020

That PR was against master [2.0]. This PR is for 1.x branch [which forked off from master awhile ago]. Running perf before this PR and after this PR & getting the numbers for this PR would ensure we do our due diligence before merging this PR.

@lanking520 lanking520 added the pr-awaiting-review PR is waiting for code review label Jul 26, 2020
@access2rohit
Copy link
Copy Markdown
Contributor Author

Code SSD 1 Epoch time (sec) %age Speedup/Slowdown w.r.t Master (large tensor disabled)
Master (large tensor disabled) 231 0% speedup/slowdown
Master (large tensor enabled) 378 63.23% slowdown
Master + CPU Optimized broadcast_axis (large tensor disabled) 237 2.6% slowdown
Master + CPU Optimized broadcast_axis (large tensor enabled) 234 1.3% slowdown

access2rohit and others added 2 commits July 28, 2020 17:17
* adding separate int32_t kernel for GPU in broadcast_axis/to/like operators

* using structure instead of temp workspace to pass stride and shape

* replacing hardcoded int32_t with generic index_t

* combining CPU and GPU kernels to leverage cached stride calculation and fast access shape data in both

Co-authored-by: Rohit Kumar Srivastava <srivastava.141@buckeyemail.osu.edu>
* adding comments explaining code optimizations

* fixing broadcast_axis kernel to int32

* fixing slice_axis kernel to int32

* combining CPU and GPU implementation method signatures and cleaned up
code

* adding new broadcast_axis to np_matmul

Co-authored-by: Rohit Kumar Srivastava <srivastava.141@buckeyemail.osu.edu>
@ChaiBapchya
Copy link
Copy Markdown
Contributor

Why is it master? Shouldn't it be v1.x?

@access2rohit
Copy link
Copy Markdown
Contributor Author

access2rohit commented Jul 28, 2020

Why is it master? Shouldn't it be v1.x?

1.x is master for all future 1.x releases

Copy link
Copy Markdown
Contributor

@ChaiBapchya ChaiBapchya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

@szha szha merged commit 7bef9cb into apache:v1.x Jul 28, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

pr-awaiting-review PR is waiting for code review

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants