Having issues getting host gpu to host gpu RDMA to work

Andrew.Lucas · July 15, 2019, 6:45pm

I’m having an issue getting a host to host GPU to GPU RDMA working correctly.
I’m setup at the ibverbs/cmverbs level.
My RDMA transaction starts by the client sending an IBV_WR_SEND request with the buffer information for the server to do a IBV_WR_RDMA_WRITE_WITH_IMM back of a much larger buffer (4kx4kx2).
The GPUs are just Quadros (K600) and the HCAs are VPI ConnectX-5s running Eth. As nearly as I can tell, the hardware should support this idea.
I can get a transfer to occur from the server host memory to the client gpu memory, but I cannot get either a gpu-gpu transfer to occur, or a transfer from server gpu memory to a client’s host memory to occur.
When transfering from the server gpu memory, what I see as a response from the IBV_WR_RDMA_WRITE_WITH_IMM is a wc failure on the server side of a local protection fault.
So basically I get that error anytime I try to do a transfer from the server gpu to a client, but not for server host memory to a client’s gpu memory. The layout is the same for the server host memory as it is for the server gpu memory (one just uses the cudaMalloc).
Software versions should all be up to date: RHEL 7.6, nv_peer_memory_1.0-8, cuda 10.1, OFED 4.6-1.09.1.1
Is there some configuration/setup item I’m missing when sourcing from a GPU memory vs the host memory? I’m just using cudaMalloc and an ibv_reg_mr call for the GPU version and posix_memalign and ibv_reg_mr for the host memory version.
Will this configuration work GPU-GPU? And if not, why would host-GPU work?
Any suggestions?

host->host ok
host->gpu ok
gpu->host fails
gpu->gpu fails

Thanks
Andy

Andrew.Lucas · July 17, 2019, 4:09pm

I cross posted this question to the mellanox site and had a suggestion to look at the perftest suite - GitHub - linux-rdma/perftest: Infiniband Verbs Performance Tests - which has RDNA code with CUDA support.

Although that was a very useful suggestion (ref perftest_resources.c in that package, ~L62 function pp_init_gpu) performing the memory allocations in that manner doesn’t seem to correct my problem (RDMA local protection fault). I’ve checked the WR pointers and sizes and they appear to be correct. Does GPUDirect just not support an IBV_WR_RDMA_WRITE_WITH_IMM transaction? The perftest only appears to test IBV_WR_RDMA_WRITE and IBV_WR_RDMA_READ.

I also question if I should be using cudaMalloc or the cuMemAlloc that the perftest is using? I thought we weren’t supposed to mix those libraries?

Andrew.Lucas · July 17, 2019, 10:27pm

With the latest perftest tool, I’m seeing a similar error (failed status 4 which is the local protection fault) but with the ib_write_bw if I use the --use_cuda option (with or without -R).

server:
./ib_write_bw -d mlx5_0 -i 1 -F --report_gbits -R --use_cuda
client:
./ib_write_bw -d mlx5_0 -i 1 -F --report_gbitgs 15.15.15.5 -R --use_cuda

It works without the “–use_cuda”.
The ib_read_bw fails in the same manner.
Any hints/ideas?
The output from cudaDeviceCanAccessPeer is 0.
If going from host to host (different computers), does that matter?
The GPUs are just K600s, compute level 3.0.

perftest output:

mlx5: D2701: got completion with error:
00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000
00000001 00000000 00000000 00000000
00000000 00008914 10000137 000097d2
Completion with error at client
Failed status 11: wr_id 0 syndrom 0x89
scnt=128, ccnt=0
Failed to complete run_iter_bw function successfully
initializing CUDA
There is 1 device supporting CUDA
[pid = 5441, dev = 0] device name = [Quadro K600]
creating CUDA Ctx
making it the current CUDA Ctx
cuMemAlloc() of a 131072 bytes GPU buffer
allocated GPU buffer address at 0000000b00a00000 pointer=0xb00a00000

RDMA_Read BW Test
Dual-port : OFF Device : mlx5_0
Number of qps : 1 Transport type : IB
Connection type : RC Using SRQ : OFF
TX depth : 128
CQ Moderation : 100
Mtu : 4096[B]
Link type : Ethernet
GID index : 2
Outstand reads : 16
rdma_cm QPs : ON
Data ex. method : rdma_cm

local address: LID 0000 QPN 0x0137 PSN 0x876fba
GID: 00:00:00:00:00:00:00:00:00:00:255:255:15:15:15:07
remote address: LID 0000 QPN 0x00b2 PSN 0xacda4
GID: 00:00:00:00:00:00:00:00:00:00:255:255:15:15:15:05

#bytes #iterations BW peak[Gb/sec] BW average[Gb/sec] MsgRate[Mpps]

Topic		Replies	Views
RDMA GPUDirect//nvidia-peer-memory/cuda issue RDMA Software For GPU software-and-drivers , howto-enable-verify-and-troubleshoo	11	1866	September 12, 2019
GPUDirect RDMA at the ibverbs level. Software And Drivers iterations , bytes	4	1426	November 30, 2020
linux-rdma perftest ib_read_bw failure with use_cuda option Software And Drivers iterations , bytes	0	1548	June 17, 2021
"--use_cuda_dmabuf" is not supported on this GPU RDMA Software For GPU	4	1950	July 31, 2023
cudaHostRegister(..., ..., cudaHostRegisterIoMemory) for PCIe device BAR0 return code=801(cudaErrorNotSupported) on Jetson Xavier Jetson AGX Xavier pcie , cuda	5	1496	November 4, 2021
What's the proper memory region access flags for GPUDirect RDMA? RDMA Software For GPU	6	753	May 24, 2023
cuMemHostRegister unexpected side effect on RDMA transfer CUDA Programming and Performance	5	690	April 26, 2019
Benchmarking GPUDirect RDMA on Modern Server Platforms Technical Blog	40	2687	April 11, 2019
Error when trying to write data to GPU DMA memory (using GPU Direct RDMA) Jetson AGX Xavier pcie , kernel , fpga	8	1450	May 30, 2023
GPUDirect question - cudaDeviceCanAccessPeer information CUDA Programming and Performance	9	4207	January 2, 2020

Having issues getting host gpu to host gpu RDMA to work

RDMA_Read BW Test Dual-port : OFF Device : mlx5_0 Number of qps : 1 Transport type : IB Connection type : RC Using SRQ : OFF TX depth : 128 CQ Moderation : 100 Mtu : 4096[B] Link type : Ethernet GID index : 2 Outstand reads : 16 rdma_cm QPs : ON Data ex. method : rdma_cm

local address: LID 0000 QPN 0x0137 PSN 0x876fba GID: 00:00:00:00:00:00:00:00:00:00:255:255:15:15:15:07 remote address: LID 0000 QPN 0x00b2 PSN 0xacda4 GID: 00:00:00:00:00:00:00:00:00:00:255:255:15:15:15:05

Related topics

RDMA_Read BW Test
Dual-port : OFF Device : mlx5_0
Number of qps : 1 Transport type : IB
Connection type : RC Using SRQ : OFF
TX depth : 128
CQ Moderation : 100
Mtu : 4096[B]
Link type : Ethernet
GID index : 2
Outstand reads : 16
rdma_cm QPs : ON
Data ex. method : rdma_cm

local address: LID 0000 QPN 0x0137 PSN 0x876fba
GID: 00:00:00:00:00:00:00:00:00:00:255:255:15:15:15:07
remote address: LID 0000 QPN 0x00b2 PSN 0xacda4
GID: 00:00:00:00:00:00:00:00:00:00:255:255:15:15:15:05