GPUDirect RDMA on Jetson Orin (nvidia_p2p_dma_map_pages)

Hi,

I have followed the below link and resolve nvidia-p2p lib issue.

When I perform gpudirect rdma to iGPU memory with size of 33177600bytes, I find that nvidia_p2p_dma_map_pages function will return only 1 entry (pages) as shown below:

[27983.043212] xdma:pevb_get_userbuf_cuda: before nvidia_p2p_dma_map_pages
[27983.043749] xdma:pevb_get_userbuf_cuda: ubuf->map->entries = 1
[27983.043751] xdma:pevb_get_userbuf_cuda: cusurf->offset = 0
[27983.043752] xdma:pevb_get_userbuf_cuda: cusurf->len = 33177600

When I perform gpudirect rdma(x86 PC) to RTX 4000 Quadro memory with size of 33177600bytes, nvidia_p2p_dma_map_pages function will return correct no of entries (pages). Everything is working fine on x86 system using GPUdirect rdma functions.

Any idea how to resolve this?

Regards
YE

Hi,

Have you tried the same on Xavier before?
The APIs between Jetson and desktop GPU is slightly different.

Thanks.

Hi,

I do not have Xavier board, only Orin.

I change the API to Tegra as show below.

I followed the code closely using this link

One thing I note is once the hardware change to Jetson, the page will change to 4K as shown below

#ifdef NV_BUILD_DGPU
#define GPU_PAGE_SHIFT 16
#else
#define GPU_PAGE_SHIFT 12
#endif
#define GPU_PAGE_SIZE (((u64)1) << GPU_PAGE_SHIFT)
#define GPU_PAGE_OFFSET (GPU_PAGE_SIZE - 1)
#define GPU_PAGE_MASK (~GPU_PAGE_OFFSET)

Regards
YE

Hi,

Thanks for the details.

Let us check with the dev team about this issue.
Will share more information with you later.

Hi,

It’s expected that RDMA can work on Orin like Xavier.
Since there are some differences in API between dGPU and iGPU, please check if you have applied all the requirements shown in the porting document below:

Thanks.

Hi,

Ok will check again.

Can I confirm that nvidia_p2p_dma_mapping **dma_mapping → entries cannot be 1 for 32Mbyte transfer size?

Thanks

Regards
YE

Hi,

The mapping size should be multiple of 4K.
You can find some discussion in the below topic:

Thanks.

Hi,

I managed to perform RDMA on Orin board using FPGA sending 4K RGBA image. Here are the draft performance values.

Both directions same performance. Around 21ms (47FPS)

Allocation of GPU buffer passed: 0
cuPointerSetAttribute(buf) passed: 0
ioctl(PIN_CUDA buf) passed: ret=0 errno=17
Allocation of GPU buffer passed: 0
cuPointerSetAttribute(buf) passed: 0
ioctl(PIN_CUDA buf) passed: ret=0 errno=17
c2h Bytes:33177600 usecs:20897 MB/s:1587.672872
h2c Bytes:33177600 usecs:20799 MB/s:1595.153613
ioctl(UNPIN_CUDA buf) passed: 0
ioctl(UNPIN_CUDA buf) passed: 0

When executing RDMA on dGPU, the values are 13.8ms on both directions.

I also use zero copy/unified memory on Orin hardware and the transfer values also around 47FPS.

Can I conclude that using RDMA/zero copy/unified memory data transfer on Orin will result same performance?

Thank you for your help and assistance.

Regards
YE

Hi,

We are double-confirming this with the internal.

Thanks.

Hi,

I try to load nvidia.ko and nvidia-p2p.ko together using the suggested modification to have display kernel as well.

I run the RDMA test again. This time is ~13.8ms comparable to dGPU on x86.

Allocation of GPU buffer passed: 0
cuPointerSetAttribute(buf) passed: 0
ioctl(PIN_CUDA buf) passed: ret=0 errno=17
Allocation of GPU buffer passed: 0
cuPointerSetAttribute(buf) passed: 0
ioctl(PIN_CUDA buf) passed: ret=0 errno=17
c2h Bytes:33177600 usecs:13895 MB/s:2387.736596
h2c Bytes:33177600 usecs:13814 MB/s:2401.737368
ioctl(UNPIN_CUDA buf) passed: 0
ioctl(UNPIN_CUDA buf) passed: 0

Thanks again for your help

Regards
YE

Hi,

Good to know you can get comparable performance now.

We also confirmed this with our internal team.
Since RDMA/ZeroCopy/Unified are all using system memory, it is expected that the performance to be similar.

Thanks.

Hi,

Ok thanks for the confirmation.

Need some clarification on iGPU and dGPU on Jetson Orin hardware.

  1. Orin can connect dGPU on PCIe interface, correct?
  2. If using dGPU on Jetson Orin, iGPU cannot be used right?
  3. If we got external PCIe device with video traffic accessing GPU memory using RDMA, which is better? iGPU or dGPU?
  4. Most dGPU has only 256MB PCIe BAR 1 memory, performing RDMA on GPU memory may not be enough. Can BAR1 memory be increased??

Thank you.

Regards
YE

Hi,

Please check below topic for more information:

Thanks

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.