DGX Spark GPUDirect RDMA

So, my dmesg was also clean and modinfo for my peermem module looks the same. The interesting thing about it though is the depends: line. It is empty whereas on a normal system with peermem it looks like:

depends: nvidia,ib_uverbs

This led me to investigate the module further as I don’t understand why the spark module wouldn’t also depend on the nvidia and ib_uverbs module. So, I ran the following:

objdump -d /lib/modules/6.11.0-1016-nvidia/kernel/nvidia-580-open/nvidia-peermem.ko 

And got:

/lib/modules/6.11.0-1016-nvidia/kernel/nvidia-580-open/nvidia-peermem.ko:     file format elf64-littleaarch64


Disassembly of section .init.text:

0000000000000000 <init_module-0x8>:
   0:	d503201f 	nop
   4:	d503201f 	nop

0000000000000008 <init_module>:
   8:	d503201f 	nop
   c:	d503201f 	nop
  10:	128002a0 	mov	w0, #0xffffffea            	// #-22
  14:	d65f03c0 	ret

Disassembly of section .exit.text:

0000000000000000 <cleanup_module>:
   0:	d65f03c0 	ret

Disassembly of section .plt:

0000000000000000 <.plt>:
	...

Disassembly of section .text.ftrace_trampoline:

0000000000000000 <.text.ftrace_trampoline>:
	...

This is why the module returns the invalid argument. It is in essence empty. However, the code for the peermem module does seem to be present on the spark in the /usr/src/nvidia-580.95.05/nvidia-peermem folder. Looking through the code, it seems like the module expects NV_MLNX_IB_PEER_MEM_SYMBOLS_PRESENT to be defined or otherwise the module’s init method just returns –EINVAL which becomes the ‘invalid argument’ we see when running modprobe. See here:

https://github.com/NVIDIA/open-gpu-kernel-modules/blob/2b436058a616676ec888ef3814d1db6b2220f2eb/kernel-open/nvidia-peermem/nvidia-peermem.c#L641

Also, NV_MLNX_IB_PEER_MEM_SYMBOLS_PRESENT seems to get defined here when building the module:

https://github.com/NVIDIA/open-gpu-kernel-modules/blob/2b436058a616676ec888ef3814d1db6b2220f2eb/kernel-open/conftest.sh#L3277

I assume that it was not defined when the module on the spark was built as there is no /usr/src/ofa_kernel folder or dkms source for ofed on the spark or wherever the module was built. But that is why I think the module is empty and does not load.

So maybe the DOCA-OFED drivers need to be installed, and the module needs to be rebuilt/replaced? However, the DGX OS user guide specifically states that the spark doesn’t require it:

So, I’m not really sure what is going on. Maybe someone from nvidia could respond and give an idea whether GPU Direct/peermem should work, will be supported at some point, isn’t intended to work, or some other method should be used as an alternative.