[Jetson AGX Orin] Is cudaHostAlloc() the official way to achieve GPUDirect RDMA on Jetson?

Hi,

I’m testing GPUDirect RDMA between two Jetson AGX Orin dev kits, each equipped with a ConnectX-6 Lx 25 GbE NIC (PCIe) and directly connected with an SFP28 cable (no switch).

Both systems are running JetPack 6.2.1.

The goal is to achieve low-latency and low-CPU-overhead real-time camera video transfer between GPUs.


➀ About cudaHostAlloc()

According to the official NVIDIA documentation:
GPUDirect RDMA Guide

On Tegra platforms, applications must replace cudaMalloc() with cudaHostAlloc() when using GPUDirect RDMA.

Since cudaHostAlloc() (with the cudaHostAllocMapped flag) allocates pinned host memory (system RAM), I understand this is not true GPU device memory.

In this case, is this method considered the official and correct way to achieve GPUDirect RDMA on Jetson AGX Orin?

Or should it be regarded as a different mechanism (like zero-copy transfer using pinned host memory) rather than true GPUDirect RDMA?

➁ About cudaMalloc() and nvidia-peermem

When I use ibv_reg_mr() to register memory allocated by cudaMalloc(), the call fails with “Bad address”, while the same registration works fine when the memory is allocated with cudaHostAlloc().

So I’d like to confirm:

  • On Jetson, is the intended/official approach to use cudaHostAlloc() (pinned host memory) for GPUDirect RDMA without relying on drivers like nvidia-peermem?
  • In other words, for Jetson AGX Orin, should we avoid trying to use cudaMalloc() device memory with verbs and instead follow the cudaHostAlloc() path as the correct model?

Any clarification about the officially recommended approach on Jetson AGX Orin for GPUDirect RDMA would be greatly appreciated.

Thanks!

Hi,

You can find an example for Orin’s GPU RDMA below:

More precisely, the difference in allocation can be found in the line below:
https://github.com/NVIDIA/jetson-rdma-picoevb/blob/rel-36%2B/client-applications/rdma-cuda.cu#L101

#ifdef NV_BUILD_DGPU
	ce = cudaMalloc(&dst_d, SURFACE_SIZE * sizeof(*dst_d));
#else
	ce = cudaHostAlloc(&dst_d, SURFACE_SIZE * sizeof(*dst_d),
		cudaHostAllocDefault);
#endif

That’s because the memory on Jetson is shared between the CPU and the GPU.
But on a desktop environment, the GPU has its own dedicated memory, so the mechanism is different.

Thanks.

Thanks for the helpful explanation and reference — that was very clear.
I’d like to confirm a couple of additional points:

  1. Terminology
    Just to reconfirm: on Jetson, is it correct to call the RDMA transfer using memory allocated by cudaHostAlloc() (pinned host memory) “GPUDirect RDMA”?

  2. cudaHostAllocDefault vs cudaHostAllocMapped
    In the sample code, cudaHostAllocDefault is used, but I was under the impression that zero-copy would not be enabled without cudaHostAllocMapped.
    Could you clarify the behavioral difference between these two flags on Jetson, and which one is recommended for RDMA use?

Hi,

1. Yes.
2. The cudaHostAllocDefault is supported.
Please find below for more information:

Thanks.