Error using GPUDirect and Connect-X5

I am running libfabric 1.12.1, CUDA 11.2, and using the latest nv_peer_mem driver and MOFED 5.2-2.2.0.0. The underlying calls libfabric is using is to libibverbs.

When trying to send using a buffer from GPU memory, I get the following error:

mlx5: host_unknown: got completion with error:

00000000 00000000 00000000 00000000

00000000 00000000 00000000 00000000

00000000 00000000 00000000 00000000

00000000 04005104 0a0002fe 00028ad2

Libfabric reports that it is a protection error, however the memory is being properly registered in nv_peer_mem and I have checked that I am using the correct memory address and local key. Is there any way to interpret this dump and get to the bottom of this error?

In general, this seems to be a software issue and not the hardware and if the issue is reproduced when using Mellanox/Nvidia components requires opening a ticket with support team if there is valid support contract exists. However, there are few recommendations:

-Use latest MOFED v5.3

-If using AMD, validate that iommu=pt parameter used in kernel/GRUB configuration ( cat /proc/cmdline).

-If using MPI, try reproduce the issue with using HPC-X. libfabric is out of support scope.

-Try reproduce using perftest package -GitHub - linux-rdma/perftest: Infiniband Verbs Performance Tests - it has to be recompiled with CUDA. Check perftest documentation

Thank you