Conversation
There was a problem hiding this comment.
This ebuild probably should get a bump to -r1.
There was a problem hiding this comment.
That makes sense in general, but would you be fine making an exception is this case since the kernel version is bumped every couple of days automatically?
5e8b137 to
21665e5
Compare
Turns out we need the depmod information for NVIDIA GPU operator, so I removed this commit. |
|
Build action triggered: https://github.com/flatcar/scripts/actions/runs/13854194966 |
|
I'm still iterating on testing this PR with this new kola test: flatcar/mantle#583 |
|
Azure test on NC8as_T4_v3: green http://jenkins.infra.kinvolk.io:8080/job/container/job/test/32700/console Testing used this mantle build: |
...r/src/third_party/coreos-overlay/x11-drivers/nvidia-drivers/nvidia-drivers-535.230.02.ebuild
Show resolved
Hide resolved
...r/src/third_party/coreos-overlay/x11-drivers/nvidia-drivers/nvidia-drivers-535.230.02.ebuild
Show resolved
Hide resolved
...er/src/third_party/coreos-overlay/x11-drivers/nvidia-drivers/nvidia-drivers-570.86.15.ebuild
Show resolved
Hide resolved
76585b7 to
3e2f797
Compare
Use `uname -m` to fetch the correct driver installer for aarch64 or x86_64. Signed-off-by: Jeremi Piotrowski <jpiotrowski@microsoft.com>
Users have reported that in some cases the nvidia.service fails because /opt/nvidia/current is a directory and the symbolic link gets created inside it. I have no idea how we get there, but to make the service robust in the face of this kind of issue: - remove the directory if it exists - use `-T` with ln to ensure that symbolic link creation fails if `current` is a directory Signed-off-by: Jeremi Piotrowski <jpiotrowski@microsoft.com>
This saves space at runtime. Signed-off-by: Jeremi Piotrowski <jpiotrowski@microsoft.com>
3e2f797 to
096d688
Compare
Signed-off-by: Jeremi Piotrowski <jpiotrowski@microsoft.com>
Installers for 570 sometimes default to Open drivers, which we can't support properly at this time. Force proprietary drivers. There are also additional options that suppress certain worrisome error strings - enable those if supported too. Signed-off-by: Jeremi Piotrowski <jpiotrowski@microsoft.com>
The nspawn container runs in it's own scope, which journal output is then associated with. By passing `--keep-unit` we can guarantee that all log output will stay associated with the nvidia.service and can be viewed by running `journalctl -u nvidia.service`. Signed-off-by: Jeremi Piotrowski <jpiotrowski@microsoft.com>
So that we can pick-up kmods contained in sysexts (like zfs) and generate complete module dependency information. I thought we could skip running depmod for nvidia drivers because we manually insmod them, but nvidia's GPU operator driver validation expects to be able to run modprobe - so we have to generate them. Signed-off-by: Jeremi Piotrowski <jpiotrowski@microsoft.com>
The R535 driver branch, which is LTS, does not compile on arm64 with GCC 14/kernel 6.6. Keep amd64 on R535 and switch arm64 to R570 by default. R570 is the first driver version that I found that is currently supported and works for arm64. Signed-off-by: Jeremi Piotrowski <jpiotrowski@microsoft.com>
Signed-off-by: Jeremi Piotrowski <jpiotrowski@microsoft.com>
096d688 to
e313934
Compare
|
Added changelog. Ran through final testing last week on amd64 (azure,aws) and arm64 (aws) and tests were successful:
Mantle PR changes to testing are independent of this, and will follow. Merging for next weeks release. |
nvidia.service arm64 support & fixes Signed-off-by: Jeremi Piotrowski <jpiotrowski@microsoft.com>
|
Cherry-picked to beta as well. |
nvidia.service arm64 support & fixes
Add support for arm64 to nvidia.service and fix other related issues. Here is a brief overview of all changes:
currentsymlink may end up created as a directory, which breaks the unit silently. No idea how this can happen but handle this case.Remove depmod generation. We didn't depend on this being done and this was broken because those parts of the filesystem are readonly in the nspawn container. See also: nvidia-driver sysext hides zfs modules Flatcar#1576How to use
[ describe what reviewers need to do in order to validate this PR ]
Testing done
Tested on g5g.xlarge instances in AWS during development, but need to script this or repeat with the final PR result.
Jenkins build (covers Azure GPU instances) running here: http://jenkins.infra.kinvolk.io:8080/job/container/job/packages_all_arches/5457/cldsv/
[Describe the testing you have done before submitting this PR. Please include both the commands you issued as well as the output you got.]
changelog/directory (user-facing change, bug fix, security fix, update)/bootand/usrsize, packages, list files for any missing binaries, kernel modules, config files, kernel modules, etc.