MLX Completion with error

uk077035 · June 19, 2023, 9:50am

Good morning,

First of all the error:

mlx5: node47-031.cm.cluster: got completion with error:
00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000
0000001e 00000000 00000000 00000000
00000000 00008813 120101af 0000e3d2

Unfortunately, the interesting part (Vendor Syndrome) is not documented, as far as i could see. It would really help me, if I had the meaning of this…
The error appears to coincide with the usage of atomic operations.

Now the scenario:
I am currently developing for my thesis and ran into an issue relating to Infiniband/Mellanox Hardware.
I am using a library (GPI2), that builds on verbs and allows to run via Ethernet or Infiniband. Running via Ethernet works fine and has no issues whatsoever even with greater node and thread counts.
Once i got to test on Infiniband, things hit the fan.

The software uses a lot of threads per Node, issueing requests in parallel. The traffic consists of atomic operations, a lot of small messages and fewer, but still frequent big messages (~10KiB).

Using a small amount of nodes, the software still runs fine via infiniband, but the issues escalate fast, leading to a roughly 50% error rate on 16Nodes with 40 threads each.

The cluster i am testing on has the following specs:

ibv_devinfo reports:

ibv_devinfo
hca_id: mlx5_0
        transport:                      InfiniBand (0)
        fw_ver:                         12.26.1040
        node_guid:                      506b:4b03:00c7:5e3c
        sys_image_guid:                 506b:4b03:00c7:5e3c
        vendor_id:                      0x02c9
        vendor_part_id:                 4115
        hw_ver:                         0x0
        board_id:                       MT_2180110032
        phys_port_cnt:                  1
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                4096 (5)
                        active_mtu:             4096 (5)
                        sm_lid:                 15
                        port_lid:               255
                        port_lmc:               0x00
                        link_layer:             InfiniBand

If I can provide more information feel free to ask.
I unfortunately do not have a minimal example to reproduce the error.

I am unfortunately at my wits end, as me and my supervisours experience with Infiniband is rather limited. Every help is greatly appreciated!

chenh1 · June 20, 2023, 7:45am

Hi @uk077035 ,

Please refer to IBV_WC_REM_ACCESS_ERR (10) in https://www.rdmamojo.com/2013/02/15/ibv_poll_cq/ :

IBV_WC_REM_ACCESS_ERR (10) - Remote Access Error: a protection error occurred on a remote data buffer to be read by an RDMA Read, written by an RDMA Write or accessed by an atomic operation. This error is reported only on RDMA operations or atomic operations. Relevant for RC QPs.

There are several potential causes for this issue:

Wrong “remote access key” parameter
Wrong buffer size or address - caused the RDMA operation to exceed the remote buffer
Wrong access flag during ibv_reg_mr - for example, access_flags=0
IBV_ACCESS_REMOTE_WRITE was not set for RDMA Write operation
IBV_ACCESS_REMOTE_READ was not set for RDMA Read operation

Regards,
Chen

uk077035 · June 20, 2023, 1:36pm

Thank you for your response @chenh1 ,

I have re-checked my buffers and offsets and found no issue there. I unfortunately do not have much control over the actual buffer creation etc. as I am using the aforementioned library.
I also double checked alignment where atomic operations would take place without results.

The issue tends to occur if a lot of requests are send over the network - while working fine over Ethernet or with Infiniband and low counts of processes. Could burst traffic trigger the issue in any way, e.g. by network congestion or similar?

If not, do I interpret your previous answer correctly, by then assuming that if my buffer offsets are correct, it must be an issue with the remote access keys or access flags?

Thanks in advance!

Edit: I have checked the libraries code:
The buffers are registered with
IBV_ACCESS_REMOTE_WRITE
| IBV_ACCESS_LOCAL_WRITE
| IBV_ACCESS_REMOTE_READ
| IBV_ACCESS_REMOTE_ATOMIC

The rkey is at least set on operation, i can not check its validity though. I am assuming it is correct however, since the program is running fine for a few seconds before things go awry.

Topic		Replies	Views
I have problem porting my RDMA application from InfiniBand(Mellanox Connectx-3 40Gb IB) to RoCE(Connectx-4 100GbE). Mellanox OFED	2	705	April 3, 2016
Completion errors Software And Drivers iterations , bytes	1	1847	July 1, 2021
Why are no RDMA devices returned by ibv_get_device_list? Mellanox OFED include	2	2562	January 17, 2020
Freebsd 11.1 ConnectX-4 VPI Mellanox OFED include	7	540	February 26, 2018
RDMA read failing with Remote Invalid Request Error Mellanox OFED	4	1311	November 3, 2020
Why can't I load MLX5 module in kernel 4.19 Software And Drivers infiniband , uname	7	2471	September 9, 2019
Problems occur when testing the Programming Examples Using IBV Verbs Mellanox OFED	2	354	August 20, 2015
A newbie problem with infiniband.	0	432	March 19, 2014
Unable to build OFED from sources for Linux kernel 4.14 Mellanox OFED	1	300	January 29, 2018
I installed the network drivers. Now what? Mellanox OFED	5	62	February 24, 2025

MLX Completion with error

Related topics