First of all the error:
mlx5: node47-031.cm.cluster: got completion with error: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 0000001e 00000000 00000000 00000000 00000000 00008813 120101af 0000e3d2
Unfortunately, the interesting part (Vendor Syndrome) is not documented, as far as i could see. It would really help me, if I had the meaning of this…
The error appears to coincide with the usage of atomic operations.
Now the scenario:
I am currently developing for my thesis and ran into an issue relating to Infiniband/Mellanox Hardware.
I am using a library (GPI2), that builds on verbs and allows to run via Ethernet or Infiniband. Running via Ethernet works fine and has no issues whatsoever even with greater node and thread counts.
Once i got to test on Infiniband, things hit the fan.
The software uses a lot of threads per Node, issueing requests in parallel. The traffic consists of atomic operations, a lot of small messages and fewer, but still frequent big messages (~10KiB).
Using a small amount of nodes, the software still runs fine via infiniband, but the issues escalate fast, leading to a roughly 50% error rate on 16Nodes with 40 threads each.
The cluster i am testing on has the following specs:
ibv_devinfo hca_id: mlx5_0 transport: InfiniBand (0) fw_ver: 12.26.1040 node_guid: 506b:4b03:00c7:5e3c sys_image_guid: 506b:4b03:00c7:5e3c vendor_id: 0x02c9 vendor_part_id: 4115 hw_ver: 0x0 board_id: MT_2180110032 phys_port_cnt: 1 port: 1 state: PORT_ACTIVE (4) max_mtu: 4096 (5) active_mtu: 4096 (5) sm_lid: 15 port_lid: 255 port_lmc: 0x00 link_layer: InfiniBand
If I can provide more information feel free to ask.
I unfortunately do not have a minimal example to reproduce the error.
I am unfortunately at my wits end, as me and my supervisours experience with Infiniband is rather limited. Every help is greatly appreciated!