Hi,
I am currently trying to setup a lab with 4 H100 NVL to have GPUDirect Storage to local and remote NVMe drive.
I have follow CUDA and NVIDIA drivers installation guide:
$ nvidia-smi
Thu Sep 4 22:33:04 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.82.07 Driver Version: 580.82.07 CUDA Version: 13.0 |
+-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA H100 NVL On | 00000000:03:00.0 Off | 0 |
| N/A 39C P0 60W / 400W | 0MiB / 95830MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA H100 NVL On | 00000000:0B:00.0 Off | 0 |
| N/A 38C P0 61W / 400W | 0MiB / 95830MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 2 NVIDIA H100 NVL On | 00000000:61:00.0 Off | 0 |
| N/A 39C P0 62W / 400W | 0MiB / 95830MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 3 NVIDIA H100 NVL On | 00000000:69:00.0 Off | 0 |
| N/A 40C P0 61W / 400W | 0MiB / 95830MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
But running anything that require CUDA results in an error:
$ ./build/Samples/1_Utilities/deviceQuery/deviceQuery
./build/Samples/1_Utilities/deviceQuery/deviceQuery Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
cudaGetDeviceCount returned 3
-> initialization error
Result = FAIL
$ ./tools/gdscheck -p
cuInit Failed, error CUDA_ERROR_NOT_INITIALIZED
cuFile initialization failed
Platform verification error :
CUDA Driver API error
Switching to proprietary NVIDIA drivers actually solve this issue!
However, I cannot now use GDS as nvidia-fs is open-sourced since version 2.17.5 (currently used 2.26) and hence cannot interface with proprietary symbols from the kernel modules.
I then reach out to you there in order to hopefully find a solution.
$ dmesg | grep nvidia
[ 10.297171] nvidia-nvlink: Nvlink Core is being initialized, major device number 237
[ 10.300673] nvidia 0000:61:00.0: enabling device (0000 -> 0002)
[ 10.311994] nvidia 0000:69:00.0: enabling device (0000 -> 0002)
[ 10.330656] nvidia 0000:03:00.0: enabling device (0000 -> 0002)
[ 10.351994] nvidia 0000:0b:00.0: enabling device (0000 -> 0002)
[ 10.487855] Backport based on https://:@git-nbu.nvidia.com/r/a/mlnx_ofed/mlnx-ofa_kernel-4.0.git eb6cb58
[ 10.488744] compat.git: https://:@git-nbu.nvidia.com/r/a/mlnx_ofed/mlnx-ofa_kernel-4.0.git
[ 10.664830] nvidia-modeset: Loading NVIDIA UNIX Open Kernel Mode Setting Driver for x86_64 580.82.07 Release Build (dvs-builder@U22-I3-B07-03-2) Wed Aug 27 18:06:05 UTC 2025
[ 21.293809] nvidia_fs: no symbol version for nvidia_p2p_dma_unmap_pages
[ 21.300609] [drm] [nvidia-drm] [GPU ID 0x00006100] Loading driver
[ 21.309358] nvidia_fs: Initializing nvfs driver module
[ 21.309863] nvidia_fs: registered correctly with major number 511
[ 23.681891] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:61:00.0 on minor 1
[ 23.681915] nvidia 0000:61:00.0: [drm] No compatible format found
[ 23.681919] nvidia 0000:61:00.0: [drm] Cannot find any crtc or sizes
[ 23.681942] [drm] [nvidia-drm] [GPU ID 0x00006900] Loading driver
[ 25.917517] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:69:00.0 on minor 2
[ 25.917539] nvidia 0000:69:00.0: [drm] No compatible format found
[ 25.917542] nvidia 0000:69:00.0: [drm] Cannot find any crtc or sizes
[ 25.917573] [drm] [nvidia-drm] [GPU ID 0x00000300] Loading driver
[ 28.156453] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:03:00.0 on minor 3
[ 28.156476] nvidia 0000:03:00.0: [drm] No compatible format found
[ 28.156478] nvidia 0000:03:00.0: [drm] Cannot find any crtc or sizes
[ 28.156508] [drm] [nvidia-drm] [GPU ID 0x00000b00] Loading driver
[ 30.398077] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:0b:00.0 on minor 4
[ 30.398090] nvidia 0000:0b:00.0: [drm] No compatible format found
[ 30.398092] nvidia 0000:0b:00.0: [drm] Cannot find any crtc or sizes
nvidia_uvm 2158592 0
nvidia_peermem 16384 0
ib_uverbs 200704 2 nvidia_peermem,mlx5_ib
nvidia_fs 274432 0
nvidia_drm 139264 0
nvidia_modeset 1744896 1 nvidia_drm
video 77824 1 nvidia_modeset
nvidia 14368768 21 nvidia_uvm,nvidia_peermem,nvidia_fs,nvidia_modeset
ecc 45056 1 nvidia
$ dmesg | grep -i iommu
[ 0.000000] Command line: BOOT_IMAGE=/vmlinuz-6.8.0-79-generic root=/dev/mapper/ubuntu--vg-ubuntu--lv ro amd_iommu=off
[ 0.559183] Kernel command line: BOOT_IMAGE=/vmlinuz-6.8.0-79-generic root=/dev/mapper/ubuntu--vg-ubuntu--lv ro amd_iommu=off
[ 6.013241] iommu: Default domain type: Translated
[ 6.013241] iommu: DMA domain TLB invalidation policy: lazy mode