Hello, we are trying to connect two gpus located on two servers via RDMA and infinibands. The GPUs are Nvidia RTX 6000 Ada and the infinbands are NVIDIA ConnectX-6.
Our server has the configuration of the image, where we have the GPU connected in slot 2 (although it occupies slot 1 and 2) and the connectX in slot 3.
By looking at the connection between the infiniband and the GPU (terminal command nvidia-smi topo -m) you can see that the connection is NODE.
Terminal output:
nvidia-smi topo -m
GPU0 NIC0 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X NODE 0,2,4,6,8,10 0 N/A
NIC0 NODE X
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
NIC Legend:
NIC0: mlx5_0
According to the web page: NVIDIA Configuration | Juniper Networks, this causes a bad performance, but due to the layout of our server, it is not possible to move the gpu nor the connect X.
We have programmed two scripts in Python, one for the sending server and one for the receiving server.
The code for the server that sends data is the following
Sender code.txt (3.2 KB)
and the receiver code:
receiver code.txt (2.6 KB)
The receiver’s code follows the same structure, but we can’t see the changes in the message sent on the receiver’s side. Is it possible to make the connection between them despite having a NODE connection type?
On the other hand, we are not sure if we have the nvidia-peermem kernel enabled correctly and if this may be affecting the transfer.
Thank you very much