Hi,
I’m testing GPUDirect RDMA between two Jetson AGX Orin dev kits, each equipped with a ConnectX-6 Lx 25 GbE NIC (PCIe) and directly connected with an SFP28 cable (no switch).
Both systems are running JetPack 6.2.1.
The goal is to achieve low-latency and low-CPU-overhead real-time camera video transfer between GPUs.
➀ About cudaHostAlloc()
According to the official NVIDIA documentation:
GPUDirect RDMA Guide
On Tegra platforms, applications must replace
cudaMalloc()withcudaHostAlloc()when using GPUDirect RDMA.
Since cudaHostAlloc() (with the cudaHostAllocMapped flag) allocates pinned host memory (system RAM), I understand this is not true GPU device memory.
In this case, is this method considered the official and correct way to achieve GPUDirect RDMA on Jetson AGX Orin?
Or should it be regarded as a different mechanism (like zero-copy transfer using pinned host memory) rather than true GPUDirect RDMA?
➁ About cudaMalloc() and nvidia-peermem
When I use ibv_reg_mr() to register memory allocated by cudaMalloc(), the call fails with “Bad address”, while the same registration works fine when the memory is allocated with cudaHostAlloc().
So I’d like to confirm:
- On Jetson, is the intended/official approach to use
cudaHostAlloc()(pinned host memory) for GPUDirect RDMA without relying on drivers likenvidia-peermem? - In other words, for Jetson AGX Orin, should we avoid trying to use
cudaMalloc()device memory with verbs and instead follow thecudaHostAlloc()path as the correct model?
Any clarification about the officially recommended approach on Jetson AGX Orin for GPUDirect RDMA would be greatly appreciated.
Thanks!