Hi GDS team,
This is essentially a copy of an issue GPUDirect Storage read fails with kernel error · Issue #66 · NVIDIA/gds-nvidia-fs · GitHub in the nvidia-fs driver repo, but I wanted to reach out here as well, since I believe the main support team is more active on this channel.
—
The HW setup:
- NVIDIA DGX A100
- NVIDIA DGX Server Version 6.3.2 (GNU/Linux 5.15.0-1091-nvidia x86_64)
- NVIDIA Driver 580.65.06 (non-open driver) / CUDA Toolkit 13.0 / libcufile(-dev)-13-0
- nvidia-fs: 2.17.4
Dmesg:
Nov 04 16:37:48 dgx01.<dgx_url_removed> kernel: nvidia-fs:nvfs_pin_gpu_pages:1341 Incompatible page table version 0x00020000
Nov 04 16:37:48 dgx01.<dgx_url_removed> kernel: nvidia-fs:nvfs_pin_gpu_pages:1341 Incompatible page table version 0x00020000
Nov 04 16:37:48 dgx01.<dgx_url_removed> kernel: nvidia-fs:nvfs_pin_gpu_pages:1341 Incompatible page table version 0x00020000
Nov 04 16:37:48 dgx01.<dgx_url_removed> kernel: ------------[ cut here ]------------
Nov 04 16:37:48 dgx01.<dgx_url_removed> kernel: kernel BUG at /build/linux-nvidia-wAkuFl/linux-nvidia-5.15.0/debian/build/build-nvidia/_______________________dkms/build/nvidia-fs/2.17.5-4/build/nvfs-stat.c:407!
Nov 04 16:37:48 dgx01.<dgx_url_removed> kernel: invalid opcode: 0000 [#1] SMP NOPTI
Nov 04 16:37:48 dgx01.<dgx_url_removed> kernel: CPU: 64 PID: 43651 Comm: gdsio Tainted: P OE 5.15.0-1091-nvidia #92-Ubuntu
Nov 04 16:37:48 dgx01.<dgx_url_removed> kernel: Hardware name: NVIDIA DGXA100 920-23687-2530-000/DGXA100, BIOS 1.13 03/21/2022
Nov 04 16:37:48 dgx01.<dgx_url_removed> kernel: RIP: 0010:nvfs_update_free_gpustat+0xbc/0xc0 [nvidia_fs]
Nov 04 16:37:48 dgx01.<dgx_url_removed> kernel: Code: f0 48 29 43 28 5b 41 5c 41 5d 5d e9 ee 6d bf f5 e9 e9 6d bf f5 f0 48 29 43 20 5b 41 5c 41 5d 5d e9 d9 6d bf f5 e8 c4 a6 d5 f4 <0f> 0b 66 90 0f 1f 44 00 00 48 8b 07 48 85 c0 0f 84 41 01 00 00 48
Nov 04 16:37:48 dgx01.<dgx_url_removed> kernel: RSP: 0018:ffffb21e2c1e3d70 EFLAGS: 00010246
Nov 04 16:37:48 dgx01.<dgx_url_removed> kernel: RAX: 99474efd828c7361 RBX: 0000000000000000 RCX: 0000000000000000
Nov 04 16:37:48 dgx01.<dgx_url_removed> kernel: RDX: 000000000000000e RSI: ffff99c98f620580 RDI: ffff996a40dcb440
Nov 04 16:37:48 dgx01.<dgx_url_removed> kernel: RBP: ffffb21e2c1e3d88 R08: 0000000000000000 R09: 6c62617420656761
Nov 04 16:37:48 dgx01.<dgx_url_removed> kernel: R10: 3030303032303030 R11: 323030307830206e R12: ffff996a40dcb440
Nov 04 16:37:48 dgx01.<dgx_url_removed> kernel: R13: ffff996a40dcb5a0 R14: ffff996a40dcb5a0 R15: 00007fdb2c000000
Nov 04 16:37:48 dgx01.<dgx_url_removed> kernel: FS: 00007fdb85711000(0000) GS:ffff99c98f600000(0000) knlGS:0000000000000000
Nov 04 16:37:48 dgx01.<dgx_url_removed> kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Nov 04 16:37:48 dgx01.<dgx_url_removed> kernel: CR2: 00007fdb939069e0 CR3: 000000808f756000 CR4: 0000000000350ee0
Nov 04 16:37:48 dgx01.<dgx_url_removed> kernel: Call Trace:
Nov 04 16:37:48 dgx01.<dgx_url_removed> kernel:
Nov 04 16:37:48 dgx01.<dgx_url_removed> kernel: nvfs_unpin_gpu_pages+0xeb/0x130 [nvidia_fs]
Nov 04 16:37:48 dgx01.<dgx_url_removed> kernel: nvfs_ioctl.cold+0x2a6/0xa16 [nvidia_fs]
Nov 04 16:37:48 dgx01.<dgx_url_removed> kernel: __x64_sys_ioctl+0x95/0xd0
Nov 04 16:37:48 dgx01.<dgx_url_removed> kernel: x64_sys_call+0x1e5f/0x1fa0
Nov 04 16:37:48 dgx01.<dgx_url_removed> kernel: do_syscall_64+0x56/0xb0
Nov 04 16:37:48 dgx01.<dgx_url_removed> kernel: ? srso_return_thunk+0x5/0x10
Nov 04 16:37:48 dgx01.<dgx_url_removed> kernel: ? arch_exit_to_user_mode_prepare.constprop.0+0x1e/0xc0
Nov 04 16:37:48 dgx01.<dgx_url_removed> kernel: ? srso_return_thunk+0x5/0x10
Nov 04 16:37:48 dgx01.<dgx_url_removed> kernel: ? syscall_exit_to_user_mode+0x41/0x80
Nov 04 16:37:48 dgx01.<dgx_url_removed> kernel: ? srso_return_thunk+0x5/0x10
Nov 04 16:37:48 dgx01.<dgx_url_removed> kernel: ? do_syscall_64+0x63/0xb0
Nov 04 16:37:48 dgx01.<dgx_url_removed> kernel: entry_SYSCALL_64_after_hwframe+0x6c/0xd6
Nov 04 16:37:48 dgx01.<dgx_url_removed> kernel: RIP: 0033:0x7fdb931639bf
Nov 04 16:37:48 dgx01.<dgx_url_removed> kernel: Code: 00 48 89 44 24 18 31 c0 48 8d 44 24 60 c7 04 24 10 00 00 00 48 89 44 24 08 48 8d 44 24 20 48 89 44 24 10 b8 10 00 00 00 0f 05 <41> 89 c0 3d 00 f0 ff ff 77 1f 48 8b 44 24 18 64 48 2b 04 25 28 00
Nov 04 16:37:48 dgx01.<dgx_url_removed> kernel: RSP: 002b:00007fdb8570bf50 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
Nov 04 16:37:48 dgx01.<dgx_url_removed> kernel: RAX: ffffffffffffffda RBX: 00007fdb994fdce0 RCX: 00007fdb931639bf
Nov 04 16:37:48 dgx01.<dgx_url_removed> kernel: RDX: 00007fdb8570c010 RSI: 0000000040047403 RDI: 0000000000000015
Nov 04 16:37:48 dgx01.<dgx_url_removed> kernel: RBP: 00007fdb2c000000 R08: 00007fdb99454710 R09: 0000000000000000
Nov 04 16:37:48 dgx01.<dgx_url_removed> kernel: R10: 0000000000000001 R11: 0000000000000246 R12: 00007fdb99462d60
Nov 04 16:37:48 dgx01.<dgx_url_removed> kernel: R13: 0000000000000000 R14: 0000000010000000 R15: 00007fdb78002090
Nov 04 16:37:48 dgx01.<dgx_url_removed> kernel:
Nov 04 16:37:48 dgx01.<dgx_url_removed> kernel: Modules linked in: mst_pciconf(OE) rpcsec_gss_krb5 nfsv4 nfs fscache netfs nvme_fabrics(OE) veth xt_nat xt_tcpudp xt_conntrack nft_chain_nat xt_MASQUERADE nf_nat nf_conntrack_netlink nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xfrm_user xt_addrtype cuse overlay rdma_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_umad(OE) mlx5_ib(OE) nft_counter nft_compat nf_tables nfnetlink intel_rapl_msr intel_rapl_common amd64_edac edac_mce_amd sch_fq_codel binfmt_misc nvidia_uvm(PO) kvm_amd ib_uverbs(OE) kvm nvidia_drm(PO) nvidia_modeset(PO) rapl nvidia_peermem(PO) ipmi_ssif nls_iso8859_1 mlx5_core(OE) mlxdevm(OE) psample mlxfw(OE) acpi_ipmi joydev input_leds ccp tls k10temp pci_hyperv_intf ptdma ipmi_si mac_hid nvidia(PO) dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua ib_core(OE) nvidia_fs(O) knem(OE) br_netfilter bridge stp llc ipmi_devintf ipmi_msghandler msr efi_pstore nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables x_tables autofs4 btrfs blake2b_generic
Nov 04 16:37:48 dgx01.<dgx_url_removed> kernel: zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid0 multipath linear drm_vram_helper drm_ttm_helper hid_generic ttm cdc_ether crct10dif_pclmul ses usbhid uas enclosure crc32_pclmul drm_kms_helper usbnet ghash_clmulni_intel hid usb_storage mii raid1 sha256_ssse3 sha1_ssse3 syscopyarea aesni_intel sysfillrect crypto_simd sysimgblt fb_sys_fops cryptd cec ixgbe nvme(OE) rc_core igb mpt3sas xfrm_algo nvme_core(OE) raid_class dca i2c_algo_bit xhci_pci drm mdio scsi_transport_sas i2c_piix4 mlx_compat(OE) xhci_pci_renesas [last unloaded: mst_pci]
Nov 04 16:37:48 dgx01.<dgx_url_removed> kernel: —[ end trace 406c682961d9c3ba ]—
Terminal when I run GDSIO:
Message from syslogd@dgx01 at Nov 4 16:40:21 …
kernel:[ 1032.355747] watchdog: BUG: soft lockup - CPU#72 stuck for 82s! [gdsio:43568]
(and repeating)
I hope to get some support either here or through the GitHub issue. I’m happy to provide any additional information if needed. Thanks!
The issue happened because of P2P version change in nvidia R580.xx driver. nvidia-fs is not being rebuilt with newer symbols resulting in the crash.
please rebuild the nvidia-fs kernel module using DKMS package nvidia-fs-dkms.
reboot the node
sudo rmmod nvidia-fs.ko
# cat > /etc/dkms/nvidia-fs.conf << ‘EOF’
BUILD_DEPENDS=nvidia
BUILD_DEPENDS_REBUILD=yes
EOF
apt install -y --reinstall nvidia-fs-dkms
make sure the driver is being rebuilt.
1 Like
Thanks for the quick response! l will redirect this to the issue and give it a try tomorrow.
Just to clarify, this issue is unrelated to whether the `–dkms` flag is enabled or not in the mlnxofedinstall, right?
No it is not related to -dkms flag in the mlnxofedinstall.
1 Like
@kmodukuri Thanks for the clarification!
We did a quick check today with your guide above, and here’s the failure output we observed. We really appreciate your continued guidance and support on this!
Preparing to unpack …/nvidia-fs-dkms_2.17.3-1_amd64.deb …
Unpacking nvidia-fs-dkms (2.17.3-1) …
Setting up nvidia-fs-dkms (2.17.3-1) …
Deprecated feature: CLEAN (/usr/src/nvidia-fs-2.17.3/dkms.conf)
Deprecated feature: REMAKE_INITRD (/usr/src/nvidia-fs-2.17.3/dkms.conf)
Deprecated feature: CLEAN (/etc/dkms/nvidia-fs.conf)
Deprecated feature: REMAKE_INITRD (/etc/dkms/nvidia-fs.conf)
Creating symlink /var/lib/dkms/nvidia-fs/2.17.3/source → /usr/src/nvidia-fs-2.17.3
Deprecated feature: CLEAN (/var/lib/dkms/nvidia-fs/2.17.3/source/dkms.conf)
Deprecated feature: REMAKE_INITRD (/var/lib/dkms/nvidia-fs/2.17.3/source/dkms.conf)
Deprecated feature: CLEAN (/etc/dkms/nvidia-fs.conf)
Deprecated feature: REMAKE_INITRD (/etc/dkms/nvidia-fs.conf)
Sign command: /usr/bin/kmodsign
Signing key: /var/lib/shim-signed/mok/MOK.priv
Public certificate (MOK): /var/lib/shim-signed/mok/MOK.der
Error! Aborting build of module nvidia-fs/2.17.3 for kernel 5.15.0-1091-nvidia (x86_64) due to missing BUILD_DEPENDS: nvidia.
You may override by specifying --force.
modprobe: FATAL: Module nvidia-fs not found in directory /lib/modules/5.15.0-1091-nvidia
*** Reboot your computer and verify that the NVIDIA filesystem driver ***
*** can be loaded.***
We’re currently running the optimized NVIDIA kernel with the proprietary GPU driver, and we’ve pinned the cuFile version to 2.17, due to the DGX OS 6 guide:
As far as I understand, it shouldn’t require any DKMS modules—but please correct me if I’m wrong. I’m wondering if there’s a mismatch between the cuFile/GDS/nvidia-fs versions and our DGX OS kernel + proprietary GPU driver version.
—
There I also appended the full output of gdscheck before rebuilding the nvidia-fs kernel module :
$ /usr/local/cuda-13.0/gds/tools/gdscheck -p
GDS release version: 1.15.1.6
nvidia_fs version: 2.17 libcufile version: 2.12
Platform: x86_64
============
ENVIRONMENT:
============
CUFILE_ENV_PATH_JSON : /home/USER/cufile.json
=====================
DRIVER CONFIGURATION:
=====================
NVMe P2PDMA : Unsupported
NVMe : Supported
NVMeOF : Unsupported
SCSI : Unsupported
ScaleFlux CSD : Unsupported
NVMesh : Unsupported
DDN EXAScaler : Unsupported
IBM Spectrum Scale : Unsupported
NFS : Unsupported
BeeGFS : Unsupported
ScaTeFS : Unsupported
WekaFS : Unsupported
Userspace RDMA : Unsupported
--Mellanox PeerDirect : Enabled
--rdma library : Not Loaded (libcufile_rdma.so)
--rdma devices : Not configured
--rdma_device_status : Up: 0 Down: 0
=====================
CUFILE CONFIGURATION:
=====================
properties.use_pci_p2pdma : false
properties.use_compat_mode : false
properties.force_compat_mode : false
properties.gds_rdma_write_support : true
properties.use_poll_mode : true
properties.poll_mode_max_size_kb : 4
properties.max_batch_io_size : 128
properties.max_batch_io_timeout_msecs : 5
properties.max_direct_io_size_kb : 16384
properties.max_device_cache_size_kb : 1048576
properties.per_buffer_cache_size_kb : 1024
properties.max_device_pinned_mem_size_kb : 33554432
properties.posix_pool_slab_size_kb : 4 1024 16384
properties.posix_pool_slab_count : 128 64 64
properties.rdma_peer_affinity_policy : RoundRobin
properties.rdma_dynamic_routing : 0
fs.generic.posix_unaligned_writes : false
fs.lustre.posix_gds_min_kb: 0
fs.beegfs.posix_gds_min_kb: 0
fs.scatefs.posix_gds_min_kb: 0
fs.weka.rdma_write_support: false
fs.gpfs.gds_write_support: false
fs.gpfs.gds_async_support: true
profile.nvtx : false
profile.cufile_stats : 1
miscellaneous.api_check_aggressive : false
execution.max_io_threads : 0
execution.max_io_queue_depth : 128
execution.parallel_io : false
execution.min_io_threshold_size_kb : 1024
execution.max_request_parallelism : 0
properties.force_odirect_mode : false
properties.prefer_iouring : false
=========
GPU INFO:
=========
GPU index 0 NVIDIA A100-SXM4-40GB bar:1 bar size (MiB):65536 supports GDS, IOMMU State: Pass-through or Enabled
==============
PLATFORM INFO:
==============
IOMMU: Pass-through or enabled
Nvidia Driver Info Status: Supported only on (nvidia-fs version <= 2.17.4)
Cuda Driver Version Installed: 13000
Platform: DGXA100 920-23687-2530-000, Arch: x86_64(Linux 5.15.0-1091-nvidia)
Platform verification succeeded
After rebuilding, I lost this ko:
$ sudo modprobe nvidia-fs.ko
modprobe: FATAL: Module nvidia-fs.ko not found in directory /lib/modules/5.15.0-1091-nvidia
Hi @kmodukuri ,
I still haven’t been able to proceed. The main issue is a contradiction regarding DKMS.
Our DGX-A100 running OS6 doesn’t have any DKMS installed.
$ dkms status # empty
$ ls -l /var/lib/dkms | grep -i nvidia #empty
$ apt list --installed | grep dkms
WARNING: apt does not have a stable CLI interface. Use with caution in scripts.
dkms/unknown,now 1:3.2.1-1ubuntu2 all [installed]
$ uname -r
5.15.0-1091-nvidia
Now, regarding the nvidia-fs driver:
So this is the core contradiction I’ve identified. And the dilemma is as follows:
Do you have any suggestions on how to resolve this contradiction? From your message, I gather that the P2P-change may have caused the package manager to install an invalid nvidia-fs driver. I’m wondering if it’s possible to rebuild the driver without relying on DKMS.
(However, we’re not using the Generic Kernel, so installing DKMS isn’t necessary (is it?) in our setup either: Managing and Upgrading Software — NVIDIA DGX OS 6 User Guide )
$ cd src
$ export CONFIG_MOFED_VERSION=$(ofed_info -s | cut -d '-' -f 2)
$ sudo make
$ sudo insmod nvidia-fs.ko
This will build the kernel module, but will not add to the dkms.
Can you check if you already have the header files for Nvidia in
/usr/src/nvidia-580.65.06/nvidia/nv-p2p.h
if not present install the headers
$ sudo apt install nvidia-kernel-source-580-open -y
install the nvidia-fs dims module
$ sudo dkms install —force -m nvidia-fs/2.17.3 -k 5.15.0-1019-nvidia
@kmodukuri
Thanks. I think you suggesting I build the kernel driver from GitHub GitHub - NVIDIA/gds-nvidia-fs: NVIDIA GPUDirect Storage Driver ? But I couldn’t find version 2.17.3 in any of the GitHub branches or releases, so I didn’t proceed with the build—mainly because I’m unsure which version is correct.
Given our setup—DGX-A100 running OS 6 with the optimized NVIDIA kernel 5.15.0-1091-nvidia and the proprietary GPU driver 580.65.06—I’m wondering which branch of nvidia-fs I should actually build. (I’m fine with building from a different branch. But for new release of GDS the driver need to be switched to open source one)
—
The header is in, but in another folder:
$ ls /usr/src/nvidia-srv-580.95.05/nvidia/nv-p2p.h
/usr/src/nvidia-srv-580.95.05/nvidia/nv-p2p.h
And we have the `nvidia-kernel-source-580-server` instead of the one you listed. And we have the proprietary GPU driver on DGX OS 6, not sure the open one fits here:
$ apt list | grep nvidia-kernel-source-580
nvidia-kernel-source-580-open/unknown 580.95.05-0ubuntu1 amd64
nvidia-kernel-source-580-server-open/jammy-updates,jammy-security 580.95.05-0ubuntu0.22.04.2 amd64
nvidia-kernel-source-580-server/jammy-updates,jammy-security,now 580.95.05-0ubuntu0.22.04.2 amd64 [installed,automatic]
nvidia-kernel-source-580/unknown 580.95.05-0ubuntu1 amd64
$ sudo apt install nvidia-dkms-${NVIDIA_DRV_VERSION}-server --dry-run
Reading package lists… Done
Building dependency tree… Done
Reading state information… Done
The following NEW packages will be installed:
nvidia-dkms-580-server
0 upgraded, 1 newly installed, 0 to remove and 39 not upgraded.
Inst nvidia-dkms-580-server (580.95.05-0ubuntu0.22.04.2 Ubuntu:22.04/jammy-updates, Ubuntu:22.04/jammy-security [amd64])
Conf nvidia-dkms-580-server (580.95.05-0ubuntu0.22.04.2 Ubuntu:22.04/jammy-updates, Ubuntu:22.04/jammy-security [amd64])