mlx5_eq_async_int:390:(pid 13335): CQ error on CQN 0x45, syndrome 0x1

Hello

CFD _Solver runs intelmpi over Infiniband Mellanox -on 2 nodes ( hp apollo 230) with both rhel 79

Some times ( same computation ) crashes with messages

mlx5_core 0000:5c:00.0: mlx5_eq_async_int:390:(pid 13335): CQ error on CQN 0x45, syndrome 0x1

any hints ?

ibstat

ibstat

CA ‘mlx5_0’

CA type: MT4115

Number of ports: 1

Firmware version: 12.27.4000

Hardware version: 0

Node GUID: 0xb88303ffff79d104

System image GUID: 0xb88303ffff79d104

Port 1:

State: Active

Physical state: LinkUp

Rate: 100

Base lid: 61

LMC: 0

SM lid: 1

Capability mask: 0x2659e84a

Port GUID: 0xb88303ffff79d104

Link layer: InfiniBand

CA ‘mlx5_1’

CA type: MT4115

Number of ports: 1

Firmware version: 12.27.4000

Hardware version: 0

Node GUID: 0xb88303ffff79d105

System image GUID: 0xb88303ffff79d104

Port 1:

State: Down

Physical state: Polling

Rate: 10

Base lid: 65535

LMC: 0

SM lid: 0

Capability mask: 0x2659e848

Port GUID: 0xb88303ffff79d105

Link layer: InfiniBand

[root@fnx628 plugins]#

Hi Mrkus,

Based on the information provided, syndrome 0x1 points to IBV_WC_LOC_LEN_ERR (1) - “Local Length Error”.

Can you please provide the vendor syndrome?

Regards,

Chen

Sorry Chen for the late response,

what exactly is the “vendor syndrome” ?

Thanks

Regards

Markus