Good morning,
First of all the error:
mlx5: node47-031.cm.cluster: got completion with error:
00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000
0000001e 00000000 00000000 00000000
00000000 00008813 120101af 0000e3d2
Unfortunately, the interesting part (Vendor Syndrome) is not documented, as far as i could see. It would really help me, if I had the meaning of this…
The error appears to coincide with the usage of atomic operations.
Now the scenario:
I am currently developing for my thesis and ran into an issue relating to Infiniband/Mellanox Hardware.
I am using a library (GPI2), that builds on verbs and allows to run via Ethernet or Infiniband. Running via Ethernet works fine and has no issues whatsoever even with greater node and thread counts.
Once i got to test on Infiniband, things hit the fan.
The software uses a lot of threads per Node, issueing requests in parallel. The traffic consists of atomic operations, a lot of small messages and fewer, but still frequent big messages (~10KiB).
Using a small amount of nodes, the software still runs fine via infiniband, but the issues escalate fast, leading to a roughly 50% error rate on 16Nodes with 40 threads each.
The cluster i am testing on has the following specs:
ibv_devinfo reports:
ibv_devinfo
hca_id: mlx5_0
transport: InfiniBand (0)
fw_ver: 12.26.1040
node_guid: 506b:4b03:00c7:5e3c
sys_image_guid: 506b:4b03:00c7:5e3c
vendor_id: 0x02c9
vendor_part_id: 4115
hw_ver: 0x0
board_id: MT_2180110032
phys_port_cnt: 1
port: 1
state: PORT_ACTIVE (4)
max_mtu: 4096 (5)
active_mtu: 4096 (5)
sm_lid: 15
port_lid: 255
port_lmc: 0x00
link_layer: InfiniBand
If I can provide more information feel free to ask.
I unfortunately do not have a minimal example to reproduce the error.
I am unfortunately at my wits end, as me and my supervisours experience with Infiniband is rather limited. Every help is greatly appreciated!