Skip to content

Comments

Add tests for the NVIDIA GPU tests#433

Merged
sayanchowdhury merged 2 commits intoflatcar-masterfrom
sayan/add-nvidia-test
Jun 16, 2023
Merged

Add tests for the NVIDIA GPU tests#433
sayanchowdhury merged 2 commits intoflatcar-masterfrom
sayan/add-nvidia-test

Conversation

@sayanchowdhury
Copy link
Member

@sayanchowdhury sayanchowdhury commented May 3, 2023

Add tests for the NVIDIA GPU tests

Testing done

  • Changelog entries added in the respective changelog/ directory (user-facing change, bug fix, security fix, update)
  • Inspected CI output for image differences: /boot and /usr size, packages, list files for any missing binaries, kernel modules, config files, kernel modules, etc.

}

func verifyNvidiaInstallation(c cluster.TestCluster) {
m := c.Machines()[0]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With

Suggested change
m := c.Machines()[0]
if kola.AzureOptions.Size != "Standard_NC6s_v3" {
c.Skip("skipping due to wrong instance size")
}
m := c.Machines()[0]

I think we could do this in scripts:

diff --git a/ci-automation/vendor-testing/azure.sh b/ci-automation/vendor-testing/azure.sh
index 17081b3598..c57210d651 100755
--- a/ci-automation/vendor-testing/azure.sh
+++ b/ci-automation/vendor-testing/azure.sh
@@ -74,7 +74,7 @@ query_kola_tests() {
     kola list --platform=azure --filter "${@}"
 }
 
-other_instance_types=()
+other_instance_types=('Standard_NC6s_v3')
 if [[ "${CIA_ARCH}" = 'amd64' ]]; then
     other_instance_types+=('V1')
 fi
@@ -85,6 +85,6 @@ run_kola_tests_on_instances \
     "${CIA_FIRST_RUN}" \
     "${other_instance_types[@]}" \
     '--' \
-    'cl.internet' \
+    'cl.internet' 'cl.misc.nvidia' \
     '--' \
     "${@}"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want to keep this check, we could even move it to the SkipFunc (as kola.AzureOptions.Size should be accessible)

@sayanchowdhury sayanchowdhury force-pushed the sayan/add-nvidia-test branch 2 times, most recently from 811dc81 to 3def6fb Compare June 13, 2023 12:46
@sayanchowdhury sayanchowdhury changed the title wip: add tests for the nvidia gpu tests Add tests for the nvidia gpu tests Jun 16, 2023
@sayanchowdhury sayanchowdhury changed the title Add tests for the nvidia gpu tests Add tests for the NVIDIA GPU tests Jun 16, 2023
@sayanchowdhury sayanchowdhury requested review from a team, jepio, pothos and tormath1 June 16, 2023 02:52
// This test is to test the NVIDIA installation, limited to AZURE for now
Platforms: []string{"azure"},
Architectures: []string{"amd64"},
Flags: []register.Flag{register.NoEnableSelinux},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any chance to track the avc message? (for issue tracking and/or documentation purpose?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea! I will need to search the logs, if it's not there then I can trigger a local/Jenkins build to track the avc messages for tracking on a issue.

Copy link
Contributor

@tormath1 tormath1 Jun 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should see it in the journal logs (before disabling the SELinux flags)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I meant searching the journal logs if they are not deleted on bincache.

Comment on lines +41 to +44
out, err := c.SSH(m, "systemctl is-active nvidia.service")
if !bytes.Contains(out, []byte("inactive")) {
return fmt.Errorf("nvidia.service: %q: %v", out, err)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just FYI, you could do this way:

Suggested change
out, err := c.SSH(m, "systemctl is-active nvidia.service")
if !bytes.Contains(out, []byte("inactive")) {
return fmt.Errorf("nvidia.service: %q: %v", out, err)
}
_, err := c.SSH(m, "systemctl --quiet is-active nvidia.service")
return err

we just want to assert if the unit is active or not, if the unit is active with --quiet it will simply return 0.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is also _ = c.MustSSH(…) which fails the test.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is correct but in this case we don't want to fail the test as we are in a retry loop no?

Signed-off-by: Sayan Chowdhury <schowdhury@microsoft.com>
Signed-off-by: Sayan Chowdhury <schowdhury@microsoft.com>
@sayanchowdhury sayanchowdhury force-pushed the sayan/add-nvidia-test branch from d04c5d0 to 4cd37d7 Compare June 16, 2023 13:16
@sayanchowdhury sayanchowdhury merged commit 4771cb7 into flatcar-master Jun 16, 2023
@sayanchowdhury sayanchowdhury deleted the sayan/add-nvidia-test branch June 16, 2023 14:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants