We did a quick check today with your guide above, and here’s the failure output we observed. We really appreciate your continued guidance and support on this!
Preparing to unpack …/nvidia-fs-dkms_2.17.3-1_amd64.deb …
Unpacking nvidia-fs-dkms (2.17.3-1) …
Setting up nvidia-fs-dkms (2.17.3-1) …
Deprecated feature: CLEAN (/usr/src/nvidia-fs-2.17.3/dkms.conf)
Deprecated feature: REMAKE_INITRD (/usr/src/nvidia-fs-2.17.3/dkms.conf)
Deprecated feature: CLEAN (/etc/dkms/nvidia-fs.conf)
Deprecated feature: REMAKE_INITRD (/etc/dkms/nvidia-fs.conf)
Creating symlink /var/lib/dkms/nvidia-fs/2.17.3/source → /usr/src/nvidia-fs-2.17.3
Deprecated feature: CLEAN (/var/lib/dkms/nvidia-fs/2.17.3/source/dkms.conf)
Deprecated feature: REMAKE_INITRD (/var/lib/dkms/nvidia-fs/2.17.3/source/dkms.conf)
Deprecated feature: CLEAN (/etc/dkms/nvidia-fs.conf)
Deprecated feature: REMAKE_INITRD (/etc/dkms/nvidia-fs.conf)
Sign command: /usr/bin/kmodsign
Signing key: /var/lib/shim-signed/mok/MOK.priv
Public certificate (MOK): /var/lib/shim-signed/mok/MOK.der
Error! Aborting build of module nvidia-fs/2.17.3 for kernel 5.15.0-1091-nvidia (x86_64) due to missing BUILD_DEPENDS: nvidia.
You may override by specifying --force.
modprobe: FATAL: Module nvidia-fs not found in directory /lib/modules/5.15.0-1091-nvidia
*** Reboot your computer and verify that the NVIDIA filesystem driver ***
*** can be loaded.***
We’re currently running the optimized NVIDIA kernel with the proprietary GPU driver, and we’ve pinned the cuFile version to 2.17, due to the DGX OS 6 guide:
the Step1 with the config in `/etc/apt/preferences.d/nvidia-fs`
As far as I understand, it shouldn’t require any DKMS modules—but please correct me if I’m wrong. I’m wondering if there’s a mismatch between the cuFile/GDS/nvidia-fs versions and our DGX OS kernel + proprietary GPU driver version.
—
There I also appended the full output of gdscheck before rebuilding the nvidia-fs kernel module :
So this is the core contradiction I’ve identified. And the dilemma is as follows:
The DGX OS6 GDS guide assumes no DKMS is used
The nvidia-fs driver’s build process depends on DKMS
Do you have any suggestions on how to resolve this contradiction? From your message, I gather that the P2P-change may have caused the package manager to install an invalid nvidia-fs driver. I’m wondering if it’s possible to rebuild the driver without relying on DKMS.
Thanks. I think you suggesting I build the kernel driver from GitHub GitHub - NVIDIA/gds-nvidia-fs: NVIDIA GPUDirect Storage Driver ? But I couldn’t find version 2.17.3 in any of the GitHub branches or releases, so I didn’t proceed with the build—mainly because I’m unsure which version is correct.
Given our setup—DGX-A100 running OS 6 with the optimized NVIDIA kernel 5.15.0-1091-nvidia and the proprietary GPU driver 580.65.06—I’m wondering which branch of nvidia-fs I should actually build. (I’m fine with building from a different branch. But for new release of GDS the driver need to be switched to open source one)
—
The header is in, but in another folder:
$ ls /usr/src/nvidia-srv-580.95.05/nvidia/nv-p2p.h
/usr/src/nvidia-srv-580.95.05/nvidia/nv-p2p.h
And we have the `nvidia-kernel-source-580-server` instead of the one you listed. And we have the proprietary GPU driver on DGX OS 6, not sure the open one fits here:
$ sudo apt install nvidia-dkms-${NVIDIA_DRV_VERSION}-server --dry-run
Reading package lists… Done
Building dependency tree… Done
Reading state information… Done
The following NEW packages will be installed:
nvidia-dkms-580-server
0 upgraded, 1 newly installed, 0 to remove and 39 not upgraded.
Inst nvidia-dkms-580-server (580.95.05-0ubuntu0.22.04.2 Ubuntu:22.04/jammy-updates, Ubuntu:22.04/jammy-security [amd64])
Conf nvidia-dkms-580-server (580.95.05-0ubuntu0.22.04.2 Ubuntu:22.04/jammy-updates, Ubuntu:22.04/jammy-security [amd64])
Switched the kernel from the proprietary version to the open-source variant. But the same version of R580
Redo the mlnxofedinstall
Installed nvidia-fs-dkms, but the process apt install -y --reinstall nvidia-fs-dkms completed very quickly—no recompilation or rebuild occurred, I think.
Then it works with the new version of nvidia-fs and GDS