Clarification on requirements for GPUDirect RDMA

I have a Dell PowerEdge R750 server with an NVIDIA A100, a Quadro P1000, and Mellanox ConnectX-5 dual-100GbE NIC installed in it, all connected to the same NUMA node (the second CPU). The (2) CPUs in use are Ice Lake Xeon Gold 6354s. I have an application that uses libibverbs from MLNX_OFED to read raw Ethernet packets from the Mellanox NIC directly into a userspace ring buffer in host memory for processing. I’m using CUDA 11.4 Update 1 and driver version 470.63.01 in my testing.

I would like to move this processing to the A100 to take advantage of its much greater capabilities. I could obviously keep my current approach and just copy the data from host memory to the device, but at the data rates that I’m targeting (near 100-GbE line rate), I would prefer to copy the data directly from the NIC to device memory using GPUDirect RDMA. While this is not the typical MPI-type application where I’m using actual RDMA to read/write memory on another host or GPU, I believe that libibverbs should support writing received frames directly to GPU memory.

I made the appropriate modifications to my application:

  • I made sure to load the nvidia_peermem kernel module.
  • I changed my call to ibv_reg_mr() (which registers a memory region for use with libibverbs) to pass a device memory pointer allocated using cudaMalloc(). This API call succeeds with no error.

In fact, I never observe any API errors from libibverbs at all, suggesting to me that it believes the packets are being written to the device memory buffer as I expect. However, by inspecting the memory (both by invoking kernels that look at the values and by copying it back to the host using cudaMemcpy()), it does not appear that anything is being written to the target memory region at all. If I fill the device memory with a pattern before registering it with libibverbs, that same pattern remains throughout the run of my application, even though packets are ostensibly being written there.

I am left to assume that the current setup is not suitable for GPUDirect RDMA operation. Surprisingly, I’ve had a very hard time getting a clear story on what the requirements are. I’m aware that “being on the same NUMA node” is a necessary, but not sufficient condition for it to work. When I run nvidia-smi topo -m, I get the following output:

	    GPU0	GPU1	mlx5_0	mlx5_1	CPU Affinity	NUMA Affinity
GPU0	 X 	NODE	NODE	NODE	1,3,5,7,9,11	1
GPU1	NODE	 X 	NODE	NODE	1,3,5,7,9,11	1
mlx5_0	NODE	NODE	 X 	PIX		
mlx5_1	NODE	NODE	PIX	 X 		

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

The connection between the two GPUs, and between the NIC and either of the GPUs, are all annotated as NODE. I haven’t been able to find any indication anywhere of whether NODE connectivity should be suitable for GPUDirect operation, or whether a more direct type of connection is required. It seems to be clear that PHB, PXB, or PIX should work, and SYS should not, but I’m not sure whether NODE should.

I should note that the CUDA API indicates that the two GPUs, which have NODE connectivity between them, cannot access each other as peer devices, as indicated by the p2pBandwidthLatencyTest example:

[P2P (Peer-to-Peer) GPU Bandwidth Latency Test]
Device: 0, NVIDIA A100-PCIE-40GB, pciBusID: ca, pciDeviceID: 0, pciDomainID:0
Device: 1, Quadro P1000, pciBusID: b2, pciDeviceID: 0, pciDomainID:0
Device=0 CANNOT Access Peer Device=1
Device=1 CANNOT Access Peer Device=0

Is it possible to get more specific guidance on exactly what is required in nvidia-smi topo -m output in order to support GPUDirect RDMA? I’m left wondering if it’s even possible to use on this CPU platform, as it seems like the multiple host bridges are on the CPU package itself.

In the past I used a Dell PowerEdge R940xa server with an NVIDIA V100 and a Mellanox ConnectX-3 pro for GPUDirect RDMA. Getting the two devices on the same numa node was an effort. I don’t remember if nvidia-smi topo -m reported NODE or better (definitely made sure it wasn’t SYS though). I think NODE should work though. You might try RDMA to system memory (restricting the process to run on a CPU with proper affinity) as a sanity check.

Thanks for the data point. My application does work properly with system memory; I can read the packets into userspace buffers as expected using libibverbs. It should “just work” by replacing the userspace buffers that I register with libibverbs with CUDA device memory instead.

As an aside, I also tried the common recommendation of running the test programs from the perftest repository. After building it with CUDA support, I ran the raw_ethernet_bw test across the two ports on my Mellanox NIC (client on one, server on another). The server instance was run with --use_cuda to tell it to write the received packets to CUDA device memory. This all seems to run without any errors, and the reported throughput is near line rate:

Listing all CUDA devices in system:
CUDA device 0: PCIe address is 17:00
CUDA device 1: PCIe address is B2:00

Picking device No. 0
[pid = 9006, dev = 0] device name = [NVIDIA A100-PCIE-40GB]
creating CUDA Ctx
making it the current CUDA Ctx
cuMemAlloc() of a 18048 bytes GPU buffer
allocated GPU buffer address at 00007fb2c3200000 pointer=0x7fb2c3200000
---------------------------------------------------------------------------------------
 #bytes     #iterations    BW peak[MB/sec]    BW average[MB/sec]   MsgRate[Mpps]
 9000       6929083          0.00               11891.50		   1.385460
 9000       6929159          0.00               11892.49		   1.385576

However, if I check the throughput on the GPU’s PCIe link while the raw_ethernet_bw test is in progress, I get:

[user@r750-dev perftest]$ nvidia-smi -q -i 0 | grep Through
        Tx Throughput                     : 0 KB/s
        Rx Throughput                     : 0 KB/s

This suggests to me that the data is not being transferred to the GPU over PCIe. I see the same behavior with my own application; the libibverbs API reports no errors, and I see packet completions at the rates I would expect, but the data never seems to arrive in GPU memory.

Your setup (R750, A100 and CX-5 on same CPU socket) should generally be capable of GPUDirect RDMA.

Regarding your negative P2P results, GPUDirect PDP is generally not supported between GPUs of different architectures. This doesn’t have anything to do with the topology report from nvidia-smi

@Robert_Crovella That’s good to know regarding the P2P transfers not being supported. Do you have any other tests that you would recommend to isolate the issue? As I described above, I am seeing similar behavior both in my application and the tests from OFED’s perftest repo.

I think the OFED perftest probably works correctly and delivers data to the GPU. My suggestion would be to study the code in that repository.

Should the throughput numbers reported by nvidia-smi include any data transferred via RDMA to the GPU?

I don’t know. I can’t explain that datapoint at the moment. This is an involved topic, not a trivial undertaking. I may not have the time available in the near future to sort everything out. But I’ll state it again: I think its unlikely that the perftest you ran is broken, so I think the source code there is a roadmap to success.

@Robert_Crovella Actually, it looks like raw_ethernet_bw from perftest is broken in my case.

First, an aside: as a troubleshooting step, I had previously moved cards around in my system. Here is the current topology reported by nvidia-smi:

Wed Sep  1 13:14:15 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.63.01    Driver Version: 470.63.01    CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100-PCI...  Off  | 00000000:17:00.0 Off |                    0 |
| N/A   32C    P0    36W / 250W |      4MiB / 40536MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   1  Quadro P1000        Off  | 00000000:B2:00.0  On |                  N/A |
| 34%   40C    P8    N/A /  N/A |    132MiB /  4040MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      2379      G   /usr/libexec/Xorg                   4MiB |
|    1   N/A  N/A      2379      G   /usr/libexec/Xorg                  93MiB |
|    1   N/A  N/A      2545      G   /usr/bin/gnome-shell               36MiB |
+-----------------------------------------------------------------------------+

	GPU0	GPU1	mlx5_0	mlx5_1	mlx5_2	mlx5_3	CPU Affinity	NUMA Affinity
GPU0	 X 	SYS	NODE	NODE	SYS	SYS	0,2,4,6,8,10	0
GPU1	SYS	 X 	SYS	SYS	PHB	PHB	1,3,5,7,9,11	1
mlx5_0	NODE	SYS	 X 	PIX	SYS	SYS		
mlx5_1	NODE	SYS	PIX	 X 	SYS	SYS		
mlx5_2	SYS	PHB	SYS	SYS	 X 	PIX		
mlx5_3	SYS	PHB	SYS	SYS	PIX	 X 		

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

I made the following patch to the current master of that repository:

diff --git a/src/perftest_resources.c b/src/perftest_resources.c
index 4125c61..34a3aa7 100755
--- a/src/perftest_resources.c
+++ b/src/perftest_resources.c
@@ -1315,16 +1315,24 @@ int create_single_mr(struct pingpong_context *ctx, struct perftest_parameters *u
                ctx->is_contig_supported = FAILURE;
 
                printf("cuMemAlloc() of a %zd bytes GPU buffer\n",
                       ctx->buff_size);
                error = cuMemAlloc(&d_A, size);
                if (error != CUDA_SUCCESS) {
                        printf("cuMemAlloc error=%d\n", error);
                        return FAILURE;
                }
 
+        cuMemsetD8(d_A, 0x55, size);
+
                printf("allocated GPU buffer address at %016llx pointer=%p\n",
                       d_A, (void *)d_A);
                ctx->buf[qp_index] = (void *)d_A;
+
+        uint8_t host_buf[size];
+        cuMemcpyDtoH(host_buf, d_A, size);
+        printf("filled GPU buffer with: ");
+        for (int i = 0; i < 32; ++i) printf("%02x ", host_buf[i]);
+        printf("\n");
        } else
        #endif
 
@@ -3513,6 +3521,15 @@ int run_iter_bw_infinitely_server(struct pingpong_context *ctx, struct perftest_
 
                if (ne > 0) {
 
+#ifdef HAVE_CUDA
+                    size_t size = wc[0].byte_len;
+                    uint8_t host_buf[size];
+                    cuMemcpyDtoH(host_buf, (CUdeviceptr) ctx->buf, size);
+                    printf("received packet with: ");
+                    for (int i = 0; i < 32; ++i) printf("%02x ", host_buf[i]);
+                    printf("\n");
+#endif
+
                        for (i = 0; i < ne; i++) {
 
                                if (wc[i].status != IBV_WC_SUCCESS) {

This is a quick hack just intended to show the issue at hand; the above changes are not high quality code. All I’m doing is filling the CUDA memory that was allocated to receive packets into with the pattern 0xdeadbeef before raw_ethernet_bw tries to receive packets into that memory. If data is being written there properly via RDMA, then I would expect that repeating pattern to change.

Likewise, the second block of code is inserted in the completion queue polling loop: when a completion is received, I copy the contents of the first packet in the buffer back to the host and print the first 32 bytes. I’m running the client/server as follows:

./raw_ethernet_bw --ib-dev=mlx5_3 --source_mac b8:ce:f6:16:96:17 --dest_mac b8:ce:f6:16:a9:4f --dest_ip 10.1.1.1 --source_ip 10.1.2.1 --dest_port 10000 --source_port 10000 --client --run_infinitely --mtu 9000
./raw_ethernet_bw --ib-dev=mlx5_1 --source_mac b8:ce:f6:16:a9:4f --dest_mac b8:ce:f6:16:96:17 --source_ip 10.1.2.1 --dest_ip 10.1.1.1 --dest_port 10000 --source_port 10000 --server --run_infinitely --mtu 9000 --use_cuda=0

Note that the receiving process (the server) is using mlx5_1, which is reported as having NODE connectivity to GPU 0, which is the A100. The output of that process is:

 Max msg size in RawEth is MTU 9000
 Changing msg size to this MTU
---------------------------------------------------------------------------------------
                    Send BW Test
 Dual-port       : OFF		Device         : mlx5_3
 Number of qps   : 1		Transport type : IB
 Connection type : RawEth		Using SRQ      : OFF
 PCIe relax order: ON
 ibv_wr* API     : OFF
 RX depth        : 512
 CQ Moderation   : 1
 Mtu             : 9000[B]
 Link type       : Ethernet
 GID index       : 0
 Max inline data : 0[B]
 rdma_cm QPs	 : OFF
 Data ex. method : Ethernet
---------------------------------------------------------------------------------------
MAC attached  : B8:CE:F6:16:96:17
spec_info - dst_ip   : 10.1.2.1
spec_info - src_ip   : 10.1.1.1
spec_info - dst_port : 10000
spec_info - src_port : 10000
initializing CUDA
 Max msg size in RawEth is MTU 9000
 Changing msg size to this MTU
Listing all CUDA devices in system:
CUDA device 0: PCIe address is 17:00
CUDA device 1: PCIe address is B2:00

Picking device No. 0
[pid = 8573, dev = 0] device name = [NVIDIA A100-PCIE-40GB]
creating CUDA Ctx
making it the current CUDA Ctx
cuMemAlloc() of a 18048 bytes GPU buffer
allocated GPU buffer address at 00007fe9d1200000 pointer=0x7fe9d1200000
filled GPU buffer with: ef be ad de ef be ad de ef be ad de ef be ad de ef be ad de ef be ad de ef be ad de ef be ad de 
---------------------------------------------------------------------------------------
 #bytes     #iterations    BW peak[MB/sec]    BW average[MB/sec]   MsgRate[Mpps]
received packet with: ef be ad de ef be ad de ef be ad de ef be ad de ef be ad de ef be ad de ef be ad de ef be ad de 
received packet with: ef be ad de ef be ad de ef be ad de ef be ad de ef be ad de ef be ad de ef be ad de ef be ad de 
received packet with: ef be ad de ef be ad de ef be ad de ef be ad de ef be ad de ef be ad de ef be ad de ef be ad de 
received packet with: ef be ad de ef be ad de ef be ad de ef be ad de ef be ad de ef be ad de ef be ad de ef be ad de 
received packet with: ef be ad de ef be ad de ef be ad de ef be ad de ef be ad de ef be ad de ef be ad de ef be ad de 
received packet with: ef be ad de ef be ad de ef be ad de ef be ad de ef be ad de ef be ad de ef be ad de ef be ad deÁ 

As you can see, the contents of the device memory do not change, suggesting to me that the RDMA memory access is not working on this system. This is consistent with the PCIe throughput being shown as zero for that GPU, and consistent with the behavior of my application, which I am 95% sure is using the API correctly.

One more detail: I see the same behavior as I described above (the RDMA writes seemingly having no effect) if I try receiving packets from mlx5_3 and RDMA-ing them to GPU 1. This is in spite of their connectivity being shown as PHB. However, The Quadro P1000 is not explicitly listed anywhere as supporting RDMA, so it’s possible that it’s just a limitation of that device.