GPUDirect seemingly failing along PIX routes but not SYS

RHEL 7.9
CUDA 11.3
Nvidia driver version 465.19.01

NVIDIA Corporation GV100GL [Tesla V100 SXM2 32GB]
Mellanox Technologies MT27700 Family [ConnectX-4]

I’m working on testing some server-to-server GPUDirect RDMA transfers. I’m sending data from the CPU on one server to a cudaMalloc()'d GPU pointer on a V100 in another server. I’m sending the data via Infiniband verbs via a Mellanox adapter on the receiving server.

Here’s the topology:

After our RDMA read blocks and completes, I launch a kernel to validate the data has arrived by computing a checksum. For cases where we have a “SYS” connection between GPU and mlx_5* (mlx_52+GPU0, mlx_52+GPU1, mlx_50+GPU2, mlx_50+GPU3), this check is successful and we know the data has arrived. For the cases where we have a “PIX” connection (which should be the ideal case), no data appears to arrive even though I’m not observing any errors at any level. To me, this doesn’t make sense since “PIX” should indicate the NIC and GPU should have a very direct path each other.

Insight or ideas much appreciated. Thanks.

Bump – still actively scratching our collective heads.