Jetson Orin Developer Kit - RDMA not working

b190033 · December 10, 2024, 3:27pm

I had received a Jetson Orin Developer Kit. Someone had already installed jetpack and cuda.

$ sudo apt-cache show nvidia-jetpack
Package: nvidia-jetpack
Version: 5.1-b147
Architecture: arm64
Maintainer: NVIDIA Corporation
Installed-Size: 194
Depends: nvidia-jetpack-runtime (= 5.1-b147), nvidia-jetpack-dev (= 5.1-b147)
Homepage: http://developer.nvidia.com/jetson

$ uname -a
Linux eol-agx 5.10.104-tegra #1 SMP PREEMPT Tue Jan 24 15:09:44 PST 2023 aarch64 aarch64 aarch64 GNU/Linux

$ sudo apt-cache show nvidia-jetpack
Package: nvidia-jetpack
Version: 5.1-b147
Architecture: arm64
Maintainer: NVIDIA Corporation
Installed-Size: 194
Depends: nvidia-jetpack-runtime (= 5.1-b147), nvidia-jetpack-dev (= 5.1-b147)
Homepage: http://developer.nvidia.com/jetson

Our code was working well so i didn’t want to change anything until I started getting “Orin does not support RDMA” errors. So i first built jetson-rdma-picoevb from here GitHub - NVIDIA/jetson-rdma-picoevb: Minimal HW-based demo of GPUDirect RDMA on NVIDIA Jetson AGX Xavier running L4T. Specifically:

$ sudo apt install build-essential bc
$ cd jetson-rdma-picoevb/kernel-module/
$ ./build-for-jetson-drive-igpu-native.sh
$ sudo insmod /lib/modules/5.10.120-tegra/kernel/drivers/nv-p2p/nvidia-p2p.ko
$ sudo insmod ./picoevb-rdma.ko

getting the error at insmod nvida-p2p.ko:

insmod: ERROR: could not insert module /lib/modules/5.10.104-tegra/kernel/drivers/nv-p2p/nvidia-p2p.ko: Invalid module format

I followed the “solution” too:

following which i changed all occurances of

nvidia_p2p_cap_persistent_pages
nvidia_p2p_init_mapping
nvidia_p2p_destroy_mapping
nvidia_p2p_get_pages
nvidia_p2p_free_page_table
nvidia_p2p_put_pages
nvidia_p2p_dma_map_pages
nvidia_p2p_dma_unmap_pages
nvidia_p2p_free_dma_mapping
nvidia_p2p_register_rsync_driver
nvidia_p2p_unregister_rsync_driver
nvidia_p2p_get_rsync_registers
nvidia_p2p_put_rsync_registers

to

nvidia_p2p_cap_persistent_pages_old
nvidia_p2p_init_mapping_old
nvidia_p2p_destroy_mapping_old
nvidia_p2p_get_pages_old
nvidia_p2p_free_page_table_old
nvidia_p2p_put_pages_old
nvidia_p2p_dma_map_pages_old
nvidia_p2p_dma_unmap_pages_old
nvidia_p2p_free_dma_mapping_old
nvidia_p2p_register_rsync_driver_old
nvidia_p2p_unregister_rsync_driver_old
nvidia_p2p_get_rsync_registers_old
nvidia_p2p_put_rsync_registers_old

respectively in nv-p2p.c & nv-p2p.h, (my doubt being the first nvidia_p2p_cap_persistent_pages_old. Should it be changed because it is not a function but an int value?)
I ran make and replaced the /lib/modules/5.10.104-tegra/extra/opensrc-eisp/nvidia.ko file.

After that, I tried to insmod /nvidia-p2p.ko again but its the same “Invalid module format error” and dmesg is:


[ 4468.458651] nvidia_p2p: exports duplicate symbol nvidia_p2p_dma_map_pages (owned by nvidia)
[10767.592543] nvidia_p2p: exports duplicate symbol nvidia_p2p_dma_map_pages (owned by nvidia)
[10781.789977] nvidia_p2p: exports duplicate symbol nvidia_p2p_dma_map_pages (owned by nvidia)
[11219.215248] nvidia_p2p: exports duplicate symbol nvidia_p2p_dma_map_pages (owned by nvidia)
[11651.348943] nvidia_p2p: exports duplicate symbol nvidia_p2p_dma_map_pages (owned by nvidia)
[12790.267525] picoevb_rdma: module verification failed: signature and/or required key missing - tainting kernel
[12790.277973] picoevb_rdma: disagrees about version of symbol nvidia_p2p_dma_unmap_pages
[12790.286175] picoevb_rdma: Unknown symbol nvidia_p2p_dma_unmap_pages (err -22)
[12790.293600] picoevb_rdma: disagrees about version of symbol nvidia_p2p_get_pages
[12790.301225] picoevb_rdma: Unknown symbol nvidia_p2p_get_pages (err -22)
[12790.308057] picoevb_rdma: disagrees about version of symbol nvidia_p2p_put_pages
[12790.315675] picoevb_rdma: Unknown symbol nvidia_p2p_put_pages (err -22)
[12790.322505] picoevb_rdma: disagrees about version of symbol nvidia_p2p_dma_map_pages
[12790.330477] picoevb_rdma: Unknown symbol nvidia_p2p_dma_map_pages (err -22)
[12790.337671] picoevb_rdma: disagrees about version of symbol nvidia_p2p_free_page_table
[12790.345828] picoevb_rdma: Unknown symbol nvidia_p2p_free_page_table (err -22)
[13270.242122] nvidia_p2p: exports duplicate symbol nvidia_p2p_dma_map_pages (owned by nvidia)
[13460.444084] nvidia_p2p: exports duplicate symbol nvidia_p2p_dma_map_pages (owned by nvidia)
[13622.548524] nvidia_p2p: exports duplicate symbol nvidia_p2p_dma_map_pages (owned by nvidia)
[13842.213783] nvidia_p2p: exports duplicate symbol nvidia_p2p_dma_map_pages (owned by nvidia)
[13887.854074] nvidia_p2p: exports duplicate symbol nvidia_p2p_dma_map_pages (owned by nvidia)
[14359.759132] nvidia_p2p: exports duplicate symbol nvidia_p2p_dma_map_pages (owned by nvidia)
[16656.209245] nvidia_p2p: exports duplicate symbol nvidia_p2p_dma_map_pages (owned by nvidia)

Here is the log of when i tried to change and make the nvidia.ko
ManualChangeAndKernelInsertion.log (257.1 KB)

What am I doing wrong? Please help.

AastaLLL · December 11, 2024, 7:10am

Hi,

Suppose you can get RDMA sample works by just:

Setup device with JetPack 6.1
Checkout rel-36+ and build

No manual patch is required.
Could you give it a try?

Thanks.

b190033 · December 17, 2024, 11:36am

I have set up the device with JetPack 6.1:

$ cat /etc/nv_tegra_release
# R36 (release), REVISION: 4.0, GCID: 37537400, BOARD: generic, EABI: aarch64, DATE: Fri Sep 13 04:36:44 UTC 2024
# KERNEL_VARIANT: oot
TARGET_USERSPACE_LIB_DIR=nvidia
$ uname -a
Linux ubuntu 5.15.148-tegra #1 SMP PREEMPT Thu Sep 12 21:01:54 PDT 2024 aarch64 aarch64 aarch64 GNU/Linux
TARGET_USERSPACE_LIB_DIR_PATH=usr/lib/aarch64-linux-gnu/nvidia

But still we get Orin does not support RDMA error

So, I tried again:

getting an error at build:

$ ./build-for-jetson-drive-igpu-native.sh 
make -C "/lib/modules/5.15.148-tegra/build" "M=$PWD" "modules"
make[1]: Entering directory '/usr/src/linux-headers-5.15.148-tegra-ubuntu22.04_aarch64/3rdparty/canonical/linux-jammy/kernel-source'
  CC [M]  /home/eol/Downloads/jetson-rdma-picoevb-master/kernel-module/picoevb-rdma.o
/home/eol/Downloads/jetson-rdma-picoevb-master/kernel-module/picoevb-rdma.c:30:10: fatal error: linux/nv-p2p.h: No such file or directory
   30 | #include <linux/nv-p2p.h>
      |          ^~~~~~~~~~~~~~~~
compilation terminated.
make[2]: *** [scripts/Makefile.build:295: /home/eol/Downloads/jetson-rdma-picoevb-master/kernel-module/picoevb-rdma.o] Error 1
make[1]: *** [Makefile:1912: /home/eol/Downloads/jetson-rdma-picoevb-master/kernel-module] Error 2
make[1]: Leaving directory '/usr/src/linux-headers-5.15.148-tegra-ubuntu22.04_aarch64/3rdparty/canonical/linux-jammy/kernel-source'
make: *** [Makefile:18: modules] Error 2

b190033 · December 18, 2024, 2:50pm

Continuing from the last error I was getting: Since, it was a include error i could solve it by add "-I\<path-to-library>" in the build-for-jetson-drive-igpu-native.sh

eol@ubuntu:~/Downloads/jetson-rdma-picoevb-master/kernel-module$ cat ~/Downloads/jetson-rdma-picoevb-master/kernel-module/build-for-jetson-drive-igpu-native.sh
#!/bin/s
...
exec make EXTRA_CFLAGS="-I/usr/src/nvidia/nvidia-oot/include"

the previous error was resolved. But now, I found that the gave me nvidia_p2p* things are missing.
I thought that re-building the /usr/lib/modules/5.15.148-tegra/updates/opensrc-disp/nvidia.ko
would fix it. So I went back to my kernel sources:
and currently I am getting compiler mismatch error:

 cd ~/Downloads/Linux_for_Tegra/source/kernel/nvdisplay/
eol@ubuntu:~/Downloads/Linux_for_Tegra/source/kernel/nvdisplay$ make
make -C src/nvidia
make[1]: Entering directory '/home/eol/Downloads/Linux_for_Tegra/source/kernel/nvdisplay/src/nvidia'
make[1]: Nothing to be done for 'default'.
make[1]: Leaving directory '/home/eol/Downloads/Linux_for_Tegra/source/kernel/nvdisplay/src/nvidia'
cd kernel-open/nvidia/ && ln -sf ../../src/nvidia/_out/Linux_aarch64/nv-kernel.o nv-kernel.o_binary
make -C src/nvidia-modeset
make[1]: Entering directory '/home/eol/Downloads/Linux_for_Tegra/source/kernel/nvdisplay/src/nvidia-modeset'
make[1]: Nothing to be done for 'default'.
make[1]: Leaving directory '/home/eol/Downloads/Linux_for_Tegra/source/kernel/nvdisplay/src/nvidia-modeset'
cd kernel-open/nvidia-modeset/ && ln -sf ../../src/nvidia-modeset/_out/Linux_aarch64/nv-modeset-kernel.o nv-modeset-kernel.o_binary
make -C kernel-open modules
make[1]: Entering directory '/home/eol/Downloads/Linux_for_Tegra/source/kernel/nvdisplay/kernel-open'
make[2]: Entering directory '/usr/src/linux-headers-5.15.148-tegra-ubuntu22.04_aarch64/3rdparty/canonical/linux-jammy/kernel-source'
warning: the compiler differs from the one used to build the kernel
  The kernel was built by: gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
  You are using:           cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0

Warning: Compiler version check failed:

The major and minor number of the compiler used to
compile the kernel:

aarch64-buildroot-linux-gnu-gcc.br_real (Buildroot 2022.08) 11.3.0, GNU ld (GNU Binutils) 2.38

does not match the compiler used here:

cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Copyright (C) 2021 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.


It is recommended to set the CC environment variable
to the compiler that was used to compile the kernel.

To skip the test and silence this warning message, set
the IGNORE_CC_MISMATCH environment variable to "1".
However, mixing compiler versions between the kernel
and kernel modules can result in subtle bugs that are
difficult to diagnose.

*** Failed CC version check. ***

make[2]: Leaving directory '/usr/src/linux-headers-5.15.148-tegra-ubuntu22.04_aarch64/3rdparty/canonical/linux-jammy/kernel-source'
make[1]: Leaving directory '/home/eol/Downloads/Linux_for_Tegra/source/kernel/nvdisplay/kernel-open'

I am not sure whether i should change the compiler or whether it will work or not. Please let me know if you know.

Another approach that I thought of was reconfiguring the kernel and build my own image and kernels. I downloaded the driver source packages for rel36.4 since i’m on rel36.4 which came with JetPack 6.1
To make the modules:

tar -xvf public_sources.tbz2 #the file that i downloaded
tar -xvf Linux_for_Tegra/source/public/kernel_src.tbz2 -C /usr/src/ 
cd /usr/src/kernel/kernel-jammy-src/
sudo make menuconfig

I navigated to

Device Drivers  ---> 
    <M> InfiniBand support

(Not sure whether this will server my case of RDMA over PCIe)
then,

make -j$(nproc)
sudo make modules_install
sudo make install
reboot

Completely new to all of this, I was surprised to find that there was an image formed at /usr/src/kernel/kernel-jammy-src/arch/arm64/boot/Image. Thinking of that as a sign to make a new boot, i did (not without backing things up).

sudo cp /boot/Image /boot/Image.backup
sudo cp arch/arm64/boot/Image /boot/Image

I also made changes to my /boot/extlinux/extlinux.conf according to the instructions provided

$ cat /boot/extlinux/extlinux.conf
TIMEOUT 30
DEFAULT primary

MENU TITLE L4T boot options

LABEL primary
      MENU LABEL primary kernel
      LINUX /boot/Image
      INITRD /boot/initrd
      APPEND ${cbootargs} root=PARTUUID=4e2ed61a-0689-406b-8b81-b178b1ec7fcf rw rootwait rootfstype=ext4 mminit_loglevel=4 console=ttyTCU0,115200 console=ttyAMA0,115200 firmware_class.path=/etc/firmware fbcon=map:0 nospectre_bhb video=efifb:off console=tty0 nv-auto-config 

# When testing a custom kernel, it is recommended that you create a backup of
# the original kernel and add a new entry to this file so that the device can
# fallback to the original kernel. To do this:
#
# 1, Make a backup of the original kernel
#      sudo cp /boot/Image /boot/Image.backup
#
# 2, Copy your custom kernel into /boot/Image
#
# 3, Uncomment below menu setting lines for the original kernel
#
# 4, Reboot

LABEL CustomKernel
      MENU LABEL Custom Kernel r36.4
      LINUX /boot/Image
      INITRD /boot/initrd
      APPEND ${cbootargs} root=PARTUUID=4e2ed61a-0689-406b-8b81-b178b1ec7fcf rw rootwait rootfstype=ext4 mminit_loglevel=4 console=ttyTCU0,115200 console=ttyAMA0,115200 firmware_class.path=/etc/firmware fbcon=map:0 nospectre_bhb video=efifb:off console=tty0 nv-auto-config


LABEL backup
      MENU LABEL backup kernel
      LINUX /boot/Image.backup
      INITRD /boot/initrd
      APPEND ${cbootargs}

After booting, i found that neither did lsmod | grep rdma show anything, nor was my wifi adapter working.
Should i just have done ./install.sh?:

eol@ubuntu:/usr/src/kernel/kernel-jammy-src/arch/arm64/boot$ ll
total 54880
drwxr-xr-x  3 25011 dip      4096 Dec 17 20:05 ./
drwxr-xr-x 14 25011 dip      4096 Dec 17 17:50 ../
drwxr-xr-x 33 25011 dip      4096 Sep 13 09:29 dts/
-rw-r--r--  1 root  root 42172928 Dec 17 20:04 Image
-rw-r--r--  1 root  root      124 Dec 17 20:04 .Image.cmd
-rw-r--r--  1 root  root 14125099 Dec 17 20:05 Image.gz
-rw-r--r--  1 root  root      101 Dec 17 20:05 .Image.gz.cmd
-rw-r--r--  1 25011 dip      1562 Sep 13 09:29 install.sh
-rw-r--r--  1 25011 dip       960 Sep 13 09:29 Makefile

Please help.

AastaLLL · December 19, 2024, 4:55am

Hi,

Have you checked out the rel-36+ branch?
Please also check the below comment for the changes required for JetPack 6:

Thanks.

b190033 · December 31, 2024, 11:10am

I used the rel-36 branch. i could run make successfully following your steps. Thank you.

$ lsmod | grep rdma
picoevb_rdma           24576  0
nvidia_p2p             20480  1 picoevb_rdma

Does this confirm that my rdma will work or does it require reconfiguring & rebuilding of the kernels using make menuconfig?
I have this doubt because my rdma application that should fetch data from spectrum card using its API, is still not working.

$./rdma_fifo_fft
Found: M4i.4480-x8 sn 22609
Used sample rate: 200000000
Detected 1 CUDA Capable device(s).

Using device 0: "Orin"
Call: (SPC_COMMAND, SPC_RESET) -> Some error occurred in the kernel driver. On Linux check output of dmesg for details
$ sudo dmesg
[16722.540956] IPv6: ADDRCONF(NETDEV_CHANGE): wlP1p1s0: link becomes ready
[20266.885983] ***** BuildSGList: no support for CUDA RDMA in kernel module
[20706.029163] ***** BuildSGList: no support for CUDA RDMA in kernel module
[20914.286469] ***** BuildSGList: no support for CUDA RDMA in kernel module

I tried the test codes but it seems that they require that I use either RHS Research PicoEVB & HiTech Global HTG-K800. But I have a spectrum card. I tried replacing /dev/picoevb with /dev/spcm0 some flags were not working.
Sorry for the late reply. I had to move on only with DMA. But still can’t deny how much better RDMA could be.

AastaLLL · January 2, 2025, 7:06am

Hi,

Have tried the application on Jetson with the previous BSP?
If so, which device and BSP do you use?

There are some changes in the RDMA API.
Could you check if there is any update required in your use case?

Thanks.

system · January 29, 2025, 2:45am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
GPUDirect RDMA - Module can not be insert into kernel Jetson AGX Orin pcie , kernel , nvbugs	27	4579	November 2, 2022
OTA error at nvidia-l4t-bootloader update from JetPack 4.3 to JetPack 4.4 Jetson Nano ota	20	7031	October 18, 2021
Network driver error when recompiling the kernel on JetPack 6.2 Jetson Orin Nano kernel , ethernet	12	62	April 2, 2025
Add third party vc mipi nvidia module, how to add porting? Jetson Orin Nano camera	14	102	August 29, 2024
Jetpack 5.0.2 with Jetson Linux 35.1 is now live! Jetson AGX Orin	25	5741	January 25, 2023
OTA Update to JetPack 4.4 DP fails - error processing package nvidia-l4t-bootloader Jetson AGX Xavier nvbugs , ota	49	5540	October 18, 2021
Issues Building Custom Kernel 36.4 new Jetson Orin Nano Dev Kit Jetson Orin Nano kernel	51	1266	December 23, 2024
JetPack 4.3 - L4T R32.3.1 released Jetson Nano opencv	98	21964	June 24, 2020
Jetson TX2 failed flash Was: initial setup, cert verification failed on nvidia r32.7 repositories Jetson TX2 reflash	7	1298	June 1, 2022
PCIe DMA driver can not be loaded Jetson AGX Orin pcie	9	1640	August 31, 2022

Jetson Orin Developer Kit - RDMA not working

Related topics