Problem with nvidia-fs, driver not loaded symbol not found

Hello everyone,

after a recent (necessary) kernel upgrade on one of our servers, we experience some problems with NVIDIA GPU Direct Storage.
The server is running Ubuntu 22.04 with a 6.6.5 kernel.

The installation instructions CUDA Installation Guide for Linux note that there are special package
version restrictions for servers not running the NVIDIA open kernel driver. As I understand the GPUs on this server (V100s) are not supported by the NVIDIA open kernel driver, so we installed nvidia-gds-12-1. The nvidia driver version is 535 (packages nvidia-dkms-535, nvidia-driver-535).

Since the nvidia-fs version pulled by apt is too high (2.18.3), we manually installed dkms module nvidia-fs 2.17.4 from https://github.com/NVIDIA/gds-nvidia-fs/archive/refs/tags/v2.17.4.zip. We confirmed that the active kernel module is the 2.17.4. version.

In this setup, we can load the nvidia-fs module …

# in dmesg:
nvidia_fs: Initializing nvfs driver module
nvidia_fs: registered correctly with major number 509

… but we can not e.g. run the gdscheck tool:

$ /usr/local/cuda-12.1/gds/tools/gdscheck -p
 Platform verification error :
nvidia-fs driver is not loaded
# in dmesg:
failing symbol_get of non-GPLONLY symbol nvidia_p2p_dma_unmap_pages.
nvidia-fs:Unable to find symbol: nvidia_p2p_dma_unmap_pages
nvidia-fs:Could not load nvidia_p2p* symbols

Here are the symbols included in the running nvidia.ko, which seem to include the reported missing symbol:

$ nm -a nvidia.ko | grep nvidia_p2p_dma
0000000000000134 r __crc_nvidia_p2p_dma_map_pages
0000000000000138 r __crc_nvidia_p2p_dma_unmap_pages
0000000000000070 r __export_symbol_nvidia_p2p_dma_map_pages
0000000000000080 r __export_symbol_nvidia_p2p_dma_unmap_pages
00000000000000d8 r __kstrtabns_nvidia_p2p_dma_map_pages
00000000000000f4 r __kstrtabns_nvidia_p2p_dma_unmap_pages
00000000000000bf r __kstrtab_nvidia_p2p_dma_map_pages
00000000000000d9 r __kstrtab_nvidia_p2p_dma_unmap_pages
000000000000039c r __ksymtab_nvidia_p2p_dma_map_pages
00000000000003a8 r __ksymtab_nvidia_p2p_dma_unmap_pages
000000000000eb10 T nvidia_p2p_dma_map_pages
000000000000d9c0 T nvidia_p2p_dma_unmap_pages
000000000000eb00 T __pfx_nvidia_p2p_dma_map_pages
000000000000d9b0 T __pfx_nvidia_p2p_dma_unmap_pages

Any idea what might be the reason why nvidia_fs is not working as expected?
Let me know if I can provide any additional information, and thanks already in advance!

Possibly related post:

In the meantime, we could not find a suitable solution for this and switched back to a less recent kernel.

We do not support nvidia-fs.2.17.4 with linux kernel versions 6.6.x. To get the nvidia-fs.2.17.4 working with V100 you would need to try moving to kernel version 6.2 and below.

Thanks, we switched back to a 6.2.0 kernel.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.