Skip to content
This repository was archived by the owner on Nov 17, 2023. It is now read-only.

[master] CI/CD updates to be more stable#20740

Merged
josephevans merged 4 commits intoapache:masterfrom
josephevans:cd_pipeline_fixes
Dec 16, 2021
Merged

[master] CI/CD updates to be more stable#20740
josephevans merged 4 commits intoapache:masterfrom
josephevans:cd_pipeline_fixes

Conversation

@josephevans
Copy link
Copy Markdown
Contributor

@josephevans josephevans commented Nov 12, 2021

This PR fixes a few issues with CI/CD:

  1. When multiple processes are attempting to install a pip package at the same time, there is a race condition that causes them to fail intermittently. Since website s3 push and publish is not run inside a container, just use the awscli installed in the jenkins slave (which is up-to-date.)
  2. Remove the onednn repository after installing onednn. This change prevents all CI pipelines from failing in case the onednn repository gets corrupt (or sync issues), since any apt calls will fail.
  3. Update CUDA architectures built for Windows. Include 7.5 for Turing (which are on g4 instances,) so we can migrate to these instances for Windows CI.

@mxnet-bot
Copy link
Copy Markdown

Hey @josephevans , Thanks for submitting the PR
All tests are already queued to run once. If tests fail, you can trigger one or more tests again with the following commands:

  • To trigger all jobs: @mxnet-bot run ci [all]
  • To trigger specific jobs: @mxnet-bot run ci [job1, job2]

CI supported jobs: [centos-cpu, sanity, windows-gpu, unix-cpu, windows-cpu, miscellaneous, website, unix-gpu, centos-gpu, clang, edge]


Note:
Only following 3 categories can trigger CI :PR Author, MXNet Committer, Jenkins Admin.
All CI tests must pass before the PR can be merged.

@mseth10 mseth10 added pr-awaiting-testing PR is reviewed and waiting CI build and test pr-work-in-progress PR is still work in progress and removed pr-awaiting-testing PR is reviewed and waiting CI build and test pr-work-in-progress PR is still work in progress labels Nov 12, 2021
@mseth10 mseth10 added pr-work-in-progress PR is still work in progress pr-awaiting-testing PR is reviewed and waiting CI build and test and removed pr-awaiting-testing PR is reviewed and waiting CI build and test pr-work-in-progress PR is still work in progress labels Nov 13, 2021
@josephevans
Copy link
Copy Markdown
Contributor Author

I think the right approach would be to move the stuff into the container rather than adding another bandaid. The fact that anything runs on the host in the first place is already an issue.

Thanks for the suggestions. At this point, we do not have the resources to refactor this. If you would like to propose a solution and file a PR, we would greatly appreciate it.

@mseth10 mseth10 added pr-awaiting-testing PR is reviewed and waiting CI build and test pr-awaiting-review PR is waiting for code review and removed pr-work-in-progress PR is still work in progress pr-awaiting-testing PR is reviewed and waiting CI build and test labels Nov 15, 2021
@josephevans josephevans changed the title [master] Fix flaky CD pipeline for website builds. [master] CI/CD updates to be more stable Dec 2, 2021
@mseth10 mseth10 added pr-awaiting-testing PR is reviewed and waiting CI build and test pr-awaiting-review PR is waiting for code review pr-work-in-progress PR is still work in progress and removed pr-awaiting-review PR is waiting for code review pr-awaiting-testing PR is reviewed and waiting CI build and test labels Dec 2, 2021
… use the awscli installed in the jenkins slave (which is updated.) When multiple processes are attempting to install a pip package at the same time, there is a race condition that causes them to fail often.
Copy link
Copy Markdown
Contributor

@waytrue17 waytrue17 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

pr-awaiting-review PR is waiting for code review

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants