I have a Dell PowerEdge R750 server with an NVIDIA A100, a Quadro P1000, and Mellanox ConnectX-5 dual-100GbE NIC installed in it, all connected to the same NUMA node (the second CPU). The (2) CPUs in use are Ice Lake Xeon Gold 6354s. I have an application that uses libibverbs from MLNX_OFED to read raw Ethernet packets from the Mellanox NIC directly into a userspace ring buffer in host memory for processing. I’m using CUDA 11.4 Update 1 and driver version 470.63.01 in my testing.
I would like to move this processing to the A100 to take advantage of its much greater capabilities. I could obviously keep my current approach and just copy the data from host memory to the device, but at the data rates that I’m targeting (near 100-GbE line rate), I would prefer to copy the data directly from the NIC to device memory using GPUDirect RDMA. While this is not the typical MPI-type application where I’m using actual RDMA to read/write memory on another host or GPU, I believe that libibverbs should support writing received frames directly to GPU memory.
I made the appropriate modifications to my application:
- I made sure to load the
nvidia_peermem
kernel module. - I changed my call to
ibv_reg_mr()
(which registers a memory region for use with libibverbs) to pass a device memory pointer allocated usingcudaMalloc()
. This API call succeeds with no error.
In fact, I never observe any API errors from libibverbs at all, suggesting to me that it believes the packets are being written to the device memory buffer as I expect. However, by inspecting the memory (both by invoking kernels that look at the values and by copying it back to the host using cudaMemcpy()
), it does not appear that anything is being written to the target memory region at all. If I fill the device memory with a pattern before registering it with libibverbs, that same pattern remains throughout the run of my application, even though packets are ostensibly being written there.
I am left to assume that the current setup is not suitable for GPUDirect RDMA operation. Surprisingly, I’ve had a very hard time getting a clear story on what the requirements are. I’m aware that “being on the same NUMA node” is a necessary, but not sufficient condition for it to work. When I run nvidia-smi topo -m
, I get the following output:
GPU0 GPU1 mlx5_0 mlx5_1 CPU Affinity NUMA Affinity
GPU0 X NODE NODE NODE 1,3,5,7,9,11 1
GPU1 NODE X NODE NODE 1,3,5,7,9,11 1
mlx5_0 NODE NODE X PIX
mlx5_1 NODE NODE PIX X
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
The connection between the two GPUs, and between the NIC and either of the GPUs, are all annotated as NODE
. I haven’t been able to find any indication anywhere of whether NODE
connectivity should be suitable for GPUDirect operation, or whether a more direct type of connection is required. It seems to be clear that PHB
, PXB
, or PIX
should work, and SYS
should not, but I’m not sure whether NODE
should.
I should note that the CUDA API indicates that the two GPUs, which have NODE
connectivity between them, cannot access each other as peer devices, as indicated by the p2pBandwidthLatencyTest
example:
[P2P (Peer-to-Peer) GPU Bandwidth Latency Test]
Device: 0, NVIDIA A100-PCIE-40GB, pciBusID: ca, pciDeviceID: 0, pciDomainID:0
Device: 1, Quadro P1000, pciBusID: b2, pciDeviceID: 0, pciDomainID:0
Device=0 CANNOT Access Peer Device=1
Device=1 CANNOT Access Peer Device=0
Is it possible to get more specific guidance on exactly what is required in nvidia-smi topo -m
output in order to support GPUDirect RDMA? I’m left wondering if it’s even possible to use on this CPU platform, as it seems like the multiple host bridges are on the CPU package itself.