GPU to GPU direct data transfer with connectX and RDMA

Hello, we are trying to connect two gpus located on two servers via RDMA and infinibands. The GPUs are Nvidia RTX 6000 Ada and the infinbands are NVIDIA ConnectX-6.


Our server has the configuration of the image, where we have the GPU connected in slot 2 (although it occupies slot 1 and 2) and the connectX in slot 3.

By looking at the connection between the infiniband and the GPU (terminal command nvidia-smi topo -m) you can see that the connection is NODE.

Terminal output:

nvidia-smi topo -m
GPU0 NIC0 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X NODE 0,2,4,6,8,10 0 N/A
NIC0 NODE X

Legend:

X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks

NIC Legend:
NIC0: mlx5_0


According to the web page: NVIDIA Configuration | Juniper Networks, this causes a bad performance, but due to the layout of our server, it is not possible to move the gpu nor the connect X.

We have programmed two scripts in Python, one for the sending server and one for the receiving server.

The code for the server that sends data is the following
Sender code.txt (3.2 KB)
and the receiver code:
receiver code.txt (2.6 KB)

The receiver’s code follows the same structure, but we can’t see the changes in the message sent on the receiver’s side. Is it possible to make the connection between them despite having a NODE connection type?

On the other hand, we are not sure if we have the nvidia-peermem kernel enabled correctly and if this may be affecting the transfer.

Thank you very much

GPU0 NIC0: NODE

  • They’re connected via PCIe within the same NUMA node, but not directly on the same bridge.
  • This is a reasonably fast connection, but not the lowest latency path like PIX or NV#.

CPU Affinity for GPU0: 0,2,4,6,8,10

  • These logical CPU cores are closest to the GPU. You should pin CPU-bound tasks that interact with this GPU (e.g., data loading, preprocessing) to these cores.

NUMA Affinity: 0

  • The GPU and NIC are associated with NUMA node 0.
  • You should prefer memory allocation and CPU threads from NUMA 0 for best performance.

⚠️ GPU NUMA ID = N/A

This may appear for some drivers or systems where the GPU is not exposed with a NUMA ID in the topology mapping. It doesn’t necessarily mean there’s a problem — but NUMA-aware applications may not automatically optimize for this GPU.

For best performance:

  • Pin processes (e.g., with numactl or taskset) to CPUs in the GPU’s affinity list.
  • Allocate memory from the NUMA node closest to the GPU/NIC.
  • If optimizing GPU-NIC communication (e.g., RDMA, GPUDirect), confirm that the GPU and NIC are as close as possible (same NUMA, PIX or NV# preferred over NODE or SYS).
1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.