[v.1x] Attempt to fix v1.x cd by installing new cuda compt package#19959

Zha0q1 · 2021-02-25T22:25:31Z

We updated restricted-mxnetlinux-gpu nodes' ami to use the new NVIDIA 460 driver. So, we will also need to install the new cuda compt package so that during the test stage we can find gpus. (Old compt package won't work with the new nvidia driver)

#19948

mxnet-bot · 2021-02-25T22:25:33Z

Hey @Zha0q1 , Thanks for submitting the PR
All tests are already queued to run once. If tests fail, you can trigger one or more tests again with the following commands:

To trigger all jobs: @mxnet-bot run ci [all]
To trigger specific jobs: @mxnet-bot run ci [job1, job2]

CI supported jobs: [windows-cpu, clang, edge, unix-cpu, centos-cpu, centos-gpu, website, windows-gpu, miscellaneous, unix-gpu, sanity]

Note:
Only following 3 categories can trigger CI :PR Author, MXNet Committer, Jenkins Admin.
All CI tests must pass before the PR can be merged.

mseth10

LGTM

mseth10 · 2021-02-26T16:43:51Z

With this change, are we using CUDA 11.2 inside cu110 and cu102 docker containers?

Zha0q1 · 2021-02-26T18:17:35Z

With this change, are we using CUDA 11.2 inside cu110 and cu102 docker containers?

No we essentially just install the latest cuda compatibility package. This should not affect build tho

…pache#19959) * update cude compt for cd * Update Dockerfile.build.ubuntu_gpu_cu102 * Update Dockerfile.build.ubuntu_gpu_cu102 * Update Dockerfile.build.ubuntu_gpu_cu110 * Update runtime_functions.sh * Update Dockerfile.build.ubuntu_gpu_cu110 * Update Dockerfile.build.ubuntu_gpu_cu102

* [v1.x] Migrate to use ECR as docker cache instead of dockerhub (#19654) * [v1.x] Update CI build scripts to install python 3.6 from deadsnakes repo (#19788) * Install python3.6 from deadsnakes repo, since 3.5 is EOL'd and get-pip.py no longer works with 3.5. * Set symlink for python3 to point to newly installed 3.6 version. * Setting symlink or using update-alternatives causes add-apt-repository to fail, so instead just set alias in environment to call the correct python version. * Setup symlinks in /usr/local/bin, since it comes first in the path. * Don't use absolute path for python3 executable, just use python3 from path. Co-authored-by: Joe Evans <joeev@amazon.com> * Disable unix-gpu-cu110 pipeline for v1.x build since we now build with cuda 11.0 in windows pipelines. (#19828) Co-authored-by: Joe Evans <joeev@amazon.com> * [v1.x] For ECR, ensure we sanitize region input from environment variable (#19882) * Set default for cache_intermediate. * Make sure we sanitize region extracted from registry, since we pass it to os.system. Co-authored-by: Joe Evans <joeev@amazon.com> * [v1.x] Address CI failures with docker timeouts (v2) (#19890) * Add random sleep only, since retry attempts are already implemented. * Reduce random sleep to 2-10 sec. Co-authored-by: Joe Evans <joeev@amazon.com> * [v1.x] CI fixes to make more stable and upgradable (#19895) * Test moving pipelines from p3 to g4. * Remove fallback codecov command - the existing (first) command works and the second always fails a few times before finally succeeding (and also doesn't support the -P parameter, which causes an error.) * Stop using docker python client, since it still doesn't support latest nvidia 'gpus' attribute. Switch to using subprocess calls using list parameter (to avoid shell injections). See docker/docker-py#2395 * Remove old files. * Fix comment * Set default environment variables * Fix GPU syntax. * Use subprocess.run and redirect output to stdout, don't run docker in interactive mode. * Check if codecov works without providing parameters now. * Send docker stderr to sys.stderr * Support both nvidia-docker configurations, first try '--gpus all', and if that fails, then try '--runtime nvidia'. Co-authored-by: Joe Evans <joeev@amazon.com> * fix cd * fix cudnn version for cu10.2 buiuld * WAR the dataloader issue with forked processes holding stale references (#19924) * skip some tests * fix ski[ * [v.1x] Attempt to fix v1.x cd by installing new cuda compt package (#19959) * update cude compt for cd * Update Dockerfile.build.ubuntu_gpu_cu102 * Update Dockerfile.build.ubuntu_gpu_cu102 * Update Dockerfile.build.ubuntu_gpu_cu110 * Update runtime_functions.sh * Update Dockerfile.build.ubuntu_gpu_cu110 * Update Dockerfile.build.ubuntu_gpu_cu102 * update command Co-authored-by: Joe Evans <joseph.evans@gmail.com> Co-authored-by: Joe Evans <joeev@amazon.com> Co-authored-by: Joe Evans <github@250hacks.net> Co-authored-by: Przemyslaw Tredak <ptredak@nvidia.com>

…pache#19959) * update cude compt for cd * Update Dockerfile.build.ubuntu_gpu_cu102 * Update Dockerfile.build.ubuntu_gpu_cu102 * Update Dockerfile.build.ubuntu_gpu_cu110 * Update runtime_functions.sh * Update Dockerfile.build.ubuntu_gpu_cu110 * Update Dockerfile.build.ubuntu_gpu_cu102

* [v1.x] Migrate to use ECR as docker cache instead of dockerhub (apache#19654) * [v1.x] Update CI build scripts to install python 3.6 from deadsnakes repo (apache#19788) * Install python3.6 from deadsnakes repo, since 3.5 is EOL'd and get-pip.py no longer works with 3.5. * Set symlink for python3 to point to newly installed 3.6 version. * Setting symlink or using update-alternatives causes add-apt-repository to fail, so instead just set alias in environment to call the correct python version. * Setup symlinks in /usr/local/bin, since it comes first in the path. * Don't use absolute path for python3 executable, just use python3 from path. Co-authored-by: Joe Evans <joeev@amazon.com> * Disable unix-gpu-cu110 pipeline for v1.x build since we now build with cuda 11.0 in windows pipelines. (apache#19828) Co-authored-by: Joe Evans <joeev@amazon.com> * [v1.x] For ECR, ensure we sanitize region input from environment variable (apache#19882) * Set default for cache_intermediate. * Make sure we sanitize region extracted from registry, since we pass it to os.system. Co-authored-by: Joe Evans <joeev@amazon.com> * [v1.x] Address CI failures with docker timeouts (v2) (apache#19890) * Add random sleep only, since retry attempts are already implemented. * Reduce random sleep to 2-10 sec. Co-authored-by: Joe Evans <joeev@amazon.com> * [v1.x] CI fixes to make more stable and upgradable (apache#19895) * Test moving pipelines from p3 to g4. * Remove fallback codecov command - the existing (first) command works and the second always fails a few times before finally succeeding (and also doesn't support the -P parameter, which causes an error.) * Stop using docker python client, since it still doesn't support latest nvidia 'gpus' attribute. Switch to using subprocess calls using list parameter (to avoid shell injections). See docker/docker-py#2395 * Remove old files. * Fix comment * Set default environment variables * Fix GPU syntax. * Use subprocess.run and redirect output to stdout, don't run docker in interactive mode. * Check if codecov works without providing parameters now. * Send docker stderr to sys.stderr * Support both nvidia-docker configurations, first try '--gpus all', and if that fails, then try '--runtime nvidia'. Co-authored-by: Joe Evans <joeev@amazon.com> * fix cd * fix cudnn version for cu10.2 buiuld * WAR the dataloader issue with forked processes holding stale references (apache#19924) * skip some tests * fix ski[ * [v.1x] Attempt to fix v1.x cd by installing new cuda compt package (apache#19959) * update cude compt for cd * Update Dockerfile.build.ubuntu_gpu_cu102 * Update Dockerfile.build.ubuntu_gpu_cu102 * Update Dockerfile.build.ubuntu_gpu_cu110 * Update runtime_functions.sh * Update Dockerfile.build.ubuntu_gpu_cu110 * Update Dockerfile.build.ubuntu_gpu_cu102 * update command Co-authored-by: Joe Evans <joseph.evans@gmail.com> Co-authored-by: Joe Evans <joeev@amazon.com> Co-authored-by: Joe Evans <github@250hacks.net> Co-authored-by: Przemyslaw Tredak <ptredak@nvidia.com>

update cude compt for cd

c43e7d5

Zha0q1 requested review from aaronmarkham and marcoabreu as code owners February 25, 2021 22:25

lanking520 added the pr-awaiting-testing PR is reviewed and waiting CI build and test label Feb 25, 2021

Update Dockerfile.build.ubuntu_gpu_cu102

69bfea9

lanking520 added pr-work-in-progress PR is still work in progress and removed pr-awaiting-testing PR is reviewed and waiting CI build and test labels Feb 26, 2021

Zha0q1 closed this Feb 26, 2021

Zha0q1 added 3 commits February 25, 2021 18:42

Update Dockerfile.build.ubuntu_gpu_cu102

9b6a298

Update Dockerfile.build.ubuntu_gpu_cu110

4dba9d8

Update runtime_functions.sh

52d6503

Zha0q1 reopened this Feb 26, 2021

Zha0q1 closed this Feb 26, 2021

Zha0q1 added 2 commits February 25, 2021 18:45

Update Dockerfile.build.ubuntu_gpu_cu110

cedb28a

Update Dockerfile.build.ubuntu_gpu_cu102

4ed6ff8

Zha0q1 reopened this Feb 26, 2021

lanking520 added pr-awaiting-testing PR is reviewed and waiting CI build and test pr-work-in-progress PR is still work in progress and removed pr-work-in-progress PR is still work in progress pr-awaiting-testing PR is reviewed and waiting CI build and test labels Feb 26, 2021

Zha0q1 changed the title ~~[wip] [v.1x] Attempt to fix v1.x cd by installing new cuda compt package~~ [v.1x] Attempt to fix v1.x cd by installing new cuda compt package Feb 26, 2021

lanking520 added pr-awaiting-review PR is waiting for code review and removed pr-awaiting-testing PR is reviewed and waiting CI build and test labels Feb 26, 2021

mseth10 approved these changes Feb 26, 2021

View reviewed changes

Zha0q1 merged commit 87d7306 into apache:v1.x Feb 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[v.1x] Attempt to fix v1.x cd by installing new cuda compt package#19959

[v.1x] Attempt to fix v1.x cd by installing new cuda compt package#19959
Zha0q1 merged 7 commits intoapache:v1.xfrom
Zha0q1:v1.x_clone2

Zha0q1 commented Feb 25, 2021

Uh oh!

mxnet-bot commented Feb 25, 2021

Uh oh!

mseth10 left a comment

Uh oh!

mseth10 commented Feb 26, 2021

Uh oh!

Zha0q1 commented Feb 26, 2021 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

Zha0q1 commented Feb 25, 2021

Uh oh!

mxnet-bot commented Feb 25, 2021

Uh oh!

mseth10 left a comment

Choose a reason for hiding this comment

Uh oh!

mseth10 commented Feb 26, 2021

Uh oh!

Zha0q1 commented Feb 26, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Zha0q1 commented Feb 26, 2021 •

edited

Loading