GPUDirect RDMA - Module can not be insert into kernel

This is a followup of PCIe DMA driver can not be loaded
I installed a fresh install on the Jetson Orin with Jetpack 5.0.2.
The file /etc/nv_tegra_release has the following content: # R35 (release), REVISION: 1.0, GCID: 31346300, BOARD: t186ref, EABI: aarch64, DATE: Thu Aug 25 18:41:45 UTC 2022.
I build my custom kernel module which uses the direct DMA transfers from the PCIe card to the memory space of the GPU (GPUDirect RDMA).
But it is not possible to insert that module, as the following errors are reported by the kernel:

[  473.515741] my_dma: module verification failed: signature and/or required key missing - tainting kernel
[  473.525670] my_dma: disagrees about version of symbol nvidia_p2p_dma_unmap_pages
[  473.533324] my_dma: Unknown symbol nvidia_p2p_dma_unmap_pages (err -22)
[  473.540224] my_dma: disagrees about version of symbol nvidia_p2p_get_pages
[  473.547323] my_dma: Unknown symbol nvidia_p2p_get_pages (err -22)
[  473.553652] my_dma: disagrees about version of symbol nvidia_p2p_put_pages
[  473.560750] my_dma: Unknown symbol nvidia_p2p_put_pages (err -22)
[  473.567050] my_dma: disagrees about version of symbol nvidia_p2p_dma_map_pages
[  473.574510] my_dma: Unknown symbol nvidia_p2p_dma_map_pages (err -22)
[  473.581172] my_dma: disagrees about version of symbol nvidia_p2p_free_page_table
[  473.588813] my_dma: Unknown symbol nvidia_p2p_free_page_table (err -22)

The very same errors are produced, when I try to insert the example kernel module from GitHub - NVIDIA/jetson-rdma-picoevb: Minimal HW-based demo of GPUDirect RDMA on NVIDIA Jetson AGX Xavier running L4T

As @vandev noticed in the other topic that the header files of the toolchain on the device do not match the header files of public_sources.tbz2 for the JetPack DP 5.0.1. This is no longer the case for JetPack 5.0.2, but still it is not possible to load the kernel modules.

@kayccc as you mentioned in the other topic we should open a new topic, when the issue is still present in the JetPack 5.0.2 and it is.

1 Like

I have the same issue when trying to adopt gdrcopy (GitHub - NVIDIA/gdrcopy: A fast GPU memory copy library based on NVIDIA GPUDirect RDMA technology).

$ sudo insmod gdrdrv.ko
insmod: ERROR: could not insert module gdrdrv.ko: Invalid parameters

...
[ 3963.739146] gdrdrv: disagrees about version of symbol nvidia_p2p_get_pages
[ 3963.739384] gdrdrv: Unknown symbol nvidia_p2p_get_pages (err -22)
[ 3963.739593] gdrdrv: disagrees about version of symbol nvidia_p2p_put_pages
[ 3963.739808] gdrdrv: Unknown symbol nvidia_p2p_put_pages (err -22)
[ 3963.740025] gdrdrv: disagrees about version of symbol nvidia_p2p_free_page_table
[ 3963.740254] gdrdrv: Unknown symbol nvidia_p2p_free_page_table (err -22)

Now I have also tried downgrading to R34.1.1 and JetPack 5.0.1.
It has exactly the same problem.

Hi,

How did you rebuild this ko file? Do you use the same toolchain as the original kernel?

1 Like

I used the toolchain on the jetson.
I have successfully built other kernel modules this way. But they did not depend on other modules…
So in this case I must cross compile with Bootlin Toolchain gcc 9.3 since this module it depends on a builtin module?

I now did the following:
I flashed the 35.1 release and installed JetPack.

I compiled the kernel on my PC as described here:
https://docs.nvidia.com/jetson/archives/r35.1/DeveloperGuide/text/SD/Kernel/KernelCustomization.html
with the Driver Package (BSP) Sources and Bootlin Toolchain gcc 9.3 from:
https://developer.nvidia.com/embedded/jetson-linux

I successfully cross compiled my kernel module linked to the built kernel with:

export CROSS_COMPILE_AARCH64_PATH=~/jetson/l4t-gcc/
export CROSS_COMPILE_AARCH64=~/jetson/l4t-gcc/bin/aarch64-buildroot-linux-gnu-
export TEGRA_KERNEL_DIR=~/jetson/kernel/35.1/Linux_for_Tegra/source/public/kernel/
export CROSS_COMPILE=~/jetson/l4t-gcc/bin/aarch64-buildroot-linux-gnu-

make ARCH=arm64 -C $TEGRA_KERNEL_DIR../kernel_out M=$PWD

But it still get the “disagrees about version of symbol” on the Jetson AGX Orin :-(

What am I missing?

Hi,

I think this driver is not validated on jetpack5 before. And its dependency has problem too.

For example, nvidia_p2p_get_pages seems not really exist.

Some more observations. One problem seems to be that nvidia-p2p is not loaded. When trying to load this module manually it fails with “exports duplicate symbol” owned module nividia. I did the an experiment to unload the nvidia module. It is used by the graphical system so it must be disabled first.

sudo systemctl set-default multi-user.target
sudo reboot
*LOGIN AFTER REBOOT*
sudo modprobe -r nvidia
sudo modprobe nvidia-p2p
sudo insmod gdrdrv.ko

And hey i can load my module! Even the module built locally on the Jetson can be loaded. I have no means to actually verify p2p functionality at this stage.

WARNING! Doing this seems to kill the DisplayPort output and you can only access it with ssh even after reboot!
You can restore the system DisplayPort output with:

sudo systemctl set-default graphical.target
sudo reboot

Hi,

We are checking this issue with our internal team.
Will share more information with you later.

Thanks

1 Like

For jetson-rdma-picoevb, how are you compiling kernel module it? I mean as iGPU or dGPU.