Having issues getting host gpu to host gpu RDMA to work

I’m having an issue getting a host to host GPU to GPU RDMA working correctly.
I’m setup at the ibverbs/cmverbs level.
My RDMA transaction starts by the client sending an IBV_WR_SEND request with the buffer information for the server to do a IBV_WR_RDMA_WRITE_WITH_IMM back of a much larger buffer (4kx4kx2).
The GPUs are just Quadros (K600) and the HCAs are VPI ConnectX-5s running Eth. As nearly as I can tell, the hardware should support this idea.
I can get a transfer to occur from the server host memory to the client gpu memory, but I cannot get either a gpu-gpu transfer to occur, or a transfer from server gpu memory to a client’s host memory to occur.
When transfering from the server gpu memory, what I see as a response from the IBV_WR_RDMA_WRITE_WITH_IMM is a wc failure on the server side of a local protection fault.
So basically I get that error anytime I try to do a transfer from the server gpu to a client, but not for server host memory to a client’s gpu memory. The layout is the same for the server host memory as it is for the server gpu memory (one just uses the cudaMalloc).
Software versions should all be up to date: RHEL 7.6, nv_peer_memory_1.0-8, cuda 10.1, OFED 4.6-1.09.1.1
Is there some configuration/setup item I’m missing when sourcing from a GPU memory vs the host memory? I’m just using cudaMalloc and an ibv_reg_mr call for the GPU version and posix_memalign and ibv_reg_mr for the host memory version.
Will this configuration work GPU-GPU? And if not, why would host-GPU work?
Any suggestions?

host->host ok
host->gpu ok
gpu->host fails
gpu->gpu fails

Thanks
Andy

I cross posted this question to the mellanox site and had a suggestion to look at the perftest suite - https://github.com/linux-rdma/perftest.git - which has RDNA code with CUDA support.

Although that was a very useful suggestion (ref perftest_resources.c in that package, ~L62 function pp_init_gpu) performing the memory allocations in that manner doesn’t seem to correct my problem (RDMA local protection fault). I’ve checked the WR pointers and sizes and they appear to be correct. Does GPUDirect just not support an IBV_WR_RDMA_WRITE_WITH_IMM transaction? The perftest only appears to test IBV_WR_RDMA_WRITE and IBV_WR_RDMA_READ.

I also question if I should be using cudaMalloc or the cuMemAlloc that the perftest is using? I thought we weren’t supposed to mix those libraries?

With the latest perftest tool, I’m seeing a similar error (failed status 4 which is the local protection fault) but with the ib_write_bw if I use the --use_cuda option (with or without -R).

server:
./ib_write_bw -d mlx5_0 -i 1 -F --report_gbits -R --use_cuda
client:
./ib_write_bw -d mlx5_0 -i 1 -F --report_gbitgs 15.15.15.5 -R --use_cuda

It works without the “–use_cuda”.
The ib_read_bw fails in the same manner.
Any hints/ideas?
The output from cudaDeviceCanAccessPeer is 0.
If going from host to host (different computers), does that matter?
The GPUs are just K600s, compute level 3.0.

perftest output:

mlx5: D2701: got completion with error:
00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000
00000001 00000000 00000000 00000000
00000000 00008914 10000137 000097d2
Completion with error at client
Failed status 11: wr_id 0 syndrom 0x89
scnt=128, ccnt=0
Failed to complete run_iter_bw function successfully
initializing CUDA
There is 1 device supporting CUDA
[pid = 5441, dev = 0] device name = [Quadro K600]
creating CUDA Ctx
making it the current CUDA Ctx
cuMemAlloc() of a 131072 bytes GPU buffer
allocated GPU buffer address at 0000000b00a00000 pointer=0xb00a00000

RDMA_Read BW Test
Dual-port : OFF Device : mlx5_0
Number of qps : 1 Transport type : IB
Connection type : RC Using SRQ : OFF
TX depth : 128
CQ Moderation : 100
Mtu : 4096[B]
Link type : Ethernet
GID index : 2
Outstand reads : 16
rdma_cm QPs : ON
Data ex. method : rdma_cm

local address: LID 0000 QPN 0x0137 PSN 0x876fba
GID: 00:00:00:00:00:00:00:00:00:00:255:255:15:15:15:07
remote address: LID 0000 QPN 0x00b2 PSN 0xacda4
GID: 00:00:00:00:00:00:00:00:00:00:255:255:15:15:15:05

#bytes #iterations BW peak[Gb/sec] BW average[Gb/sec] MsgRate[Mpps]