MLX Completion with error

Good morning,

First of all the error:

mlx5: node47-031.cm.cluster: got completion with error:
00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000
0000001e 00000000 00000000 00000000
00000000 00008813 120101af 0000e3d2

Unfortunately, the interesting part (Vendor Syndrome) is not documented, as far as i could see. It would really help me, if I had the meaning of this…
The error appears to coincide with the usage of atomic operations.

Now the scenario:
I am currently developing for my thesis and ran into an issue relating to Infiniband/Mellanox Hardware.
I am using a library (GPI2), that builds on verbs and allows to run via Ethernet or Infiniband. Running via Ethernet works fine and has no issues whatsoever even with greater node and thread counts.
Once i got to test on Infiniband, things hit the fan.

The software uses a lot of threads per Node, issueing requests in parallel. The traffic consists of atomic operations, a lot of small messages and fewer, but still frequent big messages (~10KiB).

Using a small amount of nodes, the software still runs fine via infiniband, but the issues escalate fast, leading to a roughly 50% error rate on 16Nodes with 40 threads each.

The cluster i am testing on has the following specs:

ibv_devinfo reports:

ibv_devinfo
hca_id: mlx5_0
        transport:                      InfiniBand (0)
        fw_ver:                         12.26.1040
        node_guid:                      506b:4b03:00c7:5e3c
        sys_image_guid:                 506b:4b03:00c7:5e3c
        vendor_id:                      0x02c9
        vendor_part_id:                 4115
        hw_ver:                         0x0
        board_id:                       MT_2180110032
        phys_port_cnt:                  1
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                4096 (5)
                        active_mtu:             4096 (5)
                        sm_lid:                 15
                        port_lid:               255
                        port_lmc:               0x00
                        link_layer:             InfiniBand

If I can provide more information feel free to ask.
I unfortunately do not have a minimal example to reproduce the error.

I am unfortunately at my wits end, as me and my supervisours experience with Infiniband is rather limited. Every help is greatly appreciated!

Hi @uk077035 ,

Please refer to IBV_WC_REM_ACCESS_ERR (10) in https://www.rdmamojo.com/2013/02/15/ibv_poll_cq/ :

IBV_WC_REM_ACCESS_ERR (10) - Remote Access Error: a protection error occurred on a remote data buffer to be read by an RDMA Read, written by an RDMA Write or accessed by an atomic operation. This error is reported only on RDMA operations or atomic operations. Relevant for RC QPs.

There are several potential causes for this issue:

  1. Wrong “remote access key” parameter
  2. Wrong buffer size or address - caused the RDMA operation to exceed the remote buffer
  3. Wrong access flag during ibv_reg_mr - for example, access_flags=0
    IBV_ACCESS_REMOTE_WRITE was not set for RDMA Write operation
    IBV_ACCESS_REMOTE_READ was not set for RDMA Read operation

Regards,
Chen

Thank you for your response @chenh1 ,

I have re-checked my buffers and offsets and found no issue there. I unfortunately do not have much control over the actual buffer creation etc. as I am using the aforementioned library.
I also double checked alignment where atomic operations would take place without results.

The issue tends to occur if a lot of requests are send over the network - while working fine over Ethernet or with Infiniband and low counts of processes. Could burst traffic trigger the issue in any way, e.g. by network congestion or similar?

If not, do I interpret your previous answer correctly, by then assuming that if my buffer offsets are correct, it must be an issue with the remote access keys or access flags?

Thanks in advance!

Edit: I have checked the libraries code:
The buffers are registered with
IBV_ACCESS_REMOTE_WRITE
| IBV_ACCESS_LOCAL_WRITE
| IBV_ACCESS_REMOTE_READ
| IBV_ACCESS_REMOTE_ATOMIC

The rkey is at least set on operation, i can not check its validity though. I am assuming it is correct however, since the program is running fine for a few seconds before things go awry.