GPUDirect RDMA - Module can not be insert into kernel cont'd

I’m using JP r35 revision 4.1

I build the picoevb kernel module but cant insmod it:

insmod: ERROR: could not insert module ./picoevb-rdma.ko: Unknown symbol in module

in the dmesg log the error numbers are -2 (err -2)

I read this post:

The thing is that some of the exported functions he mentioned in the quick fix are not appear in the nv-p2p.c, for example: nvidia_p2p_dma_unmap_pages

(I used the public sources for my revision)

Hi,

GPUDirect RDMA should work by default on r35.4.1.
Could you share your detailed steps and more error output (if any) with us?

Thanks.

Hi,
I followed the instructions from git.
Downloaded zip and built the kernel module on the jetson itself.
log.log (75.5 KB)

I attached a shell log showing the JP version, kernel module build, insmod error and dmesg output.

Hi,

Please insert the nvidia-p2p.ko module first.

$ sudo apt install build-essential bc
$ cd jetson-rdma-picoevb/kernel-module/
$ ./build-for-jetson-drive-igpu-native.sh
$ sudo insmod /lib/modules/5.10.120-tegra/kernel/drivers/nv-p2p/nvidia-p2p.ko
$ sudo insmod ./picoevb-rdma.ko

Thanks.

Thank you.
I have another problem now:
After inserting the picoevb-rdma.ko, I don’t see that the module is being used by the kernel for the FPGA.

They stated the following:

To load the kernel module, execute:

sudo insmod ./picoevb-rdma.ko

Once the module is loaded, executing lspci -v should show that the module is in use as the kernel driver for the FPGA board:

$ lspci -v
...
0003:01:00.0 Memory controller: NVIDIA Corporation Device 0001
	Subsystem: NVIDIA Corporation Device 0001
	Flags: bus master, fast devsel, latency 0, IRQ 36
	Memory at 34210000 (32-bit, non-prefetchable) [size=4K]
	Memory at 34200000 (32-bit, non-prefetchable) [size=64K]
	Capabilities: <access denied>
	Kernel driver in use: picoevb-rdma

I don’t see a memory controller for Nvidia.
this is my lspci output:

0001:00:00.0 PCI bridge: NVIDIA Corporation Device 229e (rev a1) (prog-if 00 [Normal decode])
Flags: bus master, fast devsel, latency 0, IRQ 64
Bus: primary=00, secondary=01, subordinate=ff, sec-latency=0
I/O behind bridge: 00001000-00001fff [size=4K]
Memory behind bridge: 40000000-400fffff [size=1M]
Prefetchable memory behind bridge: [disabled]
Capabilities:
Kernel driver in use: pcieport

0001:01:00.0 Network controller: Realtek Semiconductor Co., Ltd. RTL8822CE 802.11ac PCIe Wireless Network Adapter
Subsystem: AzureWave RTL8822CE 802.11ac PCIe Wireless Network Adapter
Flags: bus master, fast devsel, latency 0, IRQ 312
I/O ports at 1000 [size=256]
Memory at 20a8000000 (64-bit, non-prefetchable) [size=64K]
Capabilities:
Kernel driver in use: rtl88x2ce
Kernel modules: rtl8822ce

0005:00:00.0 PCI bridge: NVIDIA Corporation Device 229a (rev a1) (prog-if 00 [Normal decode])
Flags: bus master, fast devsel, latency 0, IRQ 68
Bus: primary=00, secondary=01, subordinate=ff, sec-latency=0
I/O behind bridge: [disabled]
Memory behind bridge: 40000000-401fffff [size=2M]
Prefetchable memory behind bridge: [disabled]
Capabilities:
Kernel driver in use: pcieport

0005:01:00.0 Serial controller: Xilinx Corporation Device 8022 (prog-if 01 [16450])
Subsystem: Xilinx Corporation Device 0007
Flags: bus master, fast devsel, latency 0, IRQ 68
Memory at 2b28000000 (32-bit, non-prefetchable) [size=1M]
Memory at 2b28100000 (32-bit, non-prefetchable) [size=64K]
Capabilities:
Kernel driver in use: xdma

Hi,

Could you share which FPGA board you use with us?
Below are the supported boards list:

  • RHS Research PicoEVB.
  • HiTech Global HTG-K800.

Thanks.

Thank you for your response.
I’m using a Xilinx kintex ultrascale FPGA. Our guy said it has an identical IP to those you mentioned. The difference is that it uses the Xilinx XDMA driver.

Hi,

Let us check with our internal team and update more info with you later.

Thanks.

Hi,
Anything new by any chance?

Thanks

Hi,

We are still waiting for our internal team to check this.
Will update more info with you once we get a response.

Thanks.

Meanwhile I tried using the HGlobalItech FPGA.
I programmed it using the demo files and executed the rdma-cuda test.
Got many data errors in the following log:
rdma-cuda.log (24.9 MB)

What could be the reason for that?

Thanks

Hi,

We will check this with our internal team.
Thanks.

Hi,

Thanks for your patience.
We need more info to check the issue. Please help to provide.

The log of sudo lspci, sudo lspci -tv and sudo lspci -vv.

Thanks.

I’ve done some fixes and now it works.

May I ask few quick questions regarding the demo itself?

  1. I wish to verify that the DMA controller who performs the copying is the one belongs to the FPGA and not to the Jetson. (the FPGA DMA engine is the one that performs the copy from and to the GPU address space memory)

  2. Are there any “porting” or “migration” instructions for applying this demo on other DMA core?

  3. In the GPUDirect mechanism, does the GPU allocates the buffers in his own private RAM or its the same system RAM the CPU uses as well?

  • if its the same RAM, the GPU uses its own address space (different than the CPU). So what I’m actually saving is the need to translate the address from GPU address space into CPU address space? this is what makes me faster when using GPUDirect?

Thanks

Hi,

Would you mind sharing how do you fix that?
This will help other users that also facing a similar issue.

1. Our memory controller is used for FPGA read/write Jetson’s memory only.
3. Jetson is a shared memory system so GPU and CPU are all using the same physical memory.
If you want to access the buffer with GPU, you will need the GPUDirect.
Since a CPU buffer (even support DMA) is not accessible for GPU.

Thanks.

Hi,
I had some mistakes regarding the fpga design. I used older ddr model. After updating it everything worked.

I apologize but didn’t understand your answers:

  1. Does the FPGA’s DMA controller do all the work in the demo? (copy data from Jetson memory into FPGA memory and back)

  2. If CPU and GPU share the same physical RAM, why using GPUDirect RDMA is faster than using regular RDMA (not GPUDirect)

Thanks

Hi,

1. Yes.
2. These are different solutions for the RDMA.
We don’t have a benchmark table to compare these two.
However, the buffer of regular RDMA is usually only accessible to the CPU.

Thanks.

@AastaLLL thank you for your answer. Sorry for the late reply, we had holidays.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.