Hi,
on some of our machines we are facing (sporadically but yet “too often”) hangs that seem to be related with the Infiniband adapters.
Meaning: if I issue one of the following commands, it will never return to the shell and won’t react to CTRL+C:
- ibstat (this seems particularly bad)
- ping $OTHER_HOSTS_IP_ASSIGNED_TO_INFINIBAND_DEVICE
- ls /nfs/mount/on/server/connected/via/infiniband
- sudo ifdown ib0 && sudo ifup ib0
Of course, after a reboot everything is fine again :-/
I am very new to this, so my troubleshooting skills are weak. I have listed some basic information below and would be grateful for
further guidance how to debug this issue.
Thanks,
Jonas
$ grep ib0 /var/log/syslog # this is around the time when the problem happened
Dec 5 10:13:29 heinzel60 kernel: [70988.535625] ib0: ipoib_cm_tx_destroy_rss: 7 not completed for QP: 0x257 force cleanup.
Dec 5 11:55:50 heinzel60 kernel: [77129.295892] ib0: timing out; 7 sends not completed
Dec 5 11:55:55 heinzel60 kernel: [77134.300207] ib0: timing out; 7 sends not completed
Dec 5 11:56:00 heinzel60 kernel: [77139.304520] ib0: timing out; 7 sends not completed
Dec 5 11:56:05 heinzel60 kernel: [77144.308840] ib0: timing out; 7 sends not completed
Dec 5 11:56:10 heinzel60 kernel: [77149.313159] ib0: timing out; 7 sends not completed
Dec 5 11:56:10 heinzel60 kernel: [77149.313795] ib0: ipoib_cm_tx_destroy_rss: 7 not completed for QP: 0x265 force cleanup.
Dec 5 12:08:40 heinzel60 kernel: [77899.392921] ib0: timing out; 7 sends not completed
Dec 5 12:08:45 heinzel60 kernel: [77904.397237] ib0: timing out; 7 sends not completed
Dec 5 12:08:50 heinzel60 kernel: [77909.401558] ib0: timing out; 7 sends not completed
Dec 5 12:08:55 heinzel60 kernel: [77914.405874] ib0: timing out; 7 sends not completed
Dec 5 12:09:00 heinzel60 kernel: [77919.410197] ib0: timing out; 7 sends not completed
[…]
$ lspci | grep Mellanox
01:00.0 Network controller: Mellanox Technologies MT27500 Family [ConnectX-3]
$ ibstat # as said above, this worked only after reboot
CA ‘mlx4_0’
CA type: MT4099
Number of ports: 1
Firmware version: 2.42.5000
Hardware version: 1
Node GUID: 0xec0d9a0300062a80
System image GUID: 0xec0d9a0300062a83
Port 1:
State: Active
Physical state: LinkUp
Rate: 56
Base lid: 52
LMC: 0
SM lid: 25
Capability mask: 0x02514868
Port GUID: 0xec0d9a0300062a81
Link layer: InfiniBand
$ lsmod | egrep ‘ib|mlx’
ib_ucm 20480 0
ib_uverbs 106496 2 ib_ucm,rdma_ucm
mlx5_fpga_tools 16384 0
mlx5_ib 266240 0
mlx5_core 782336 2 mlx5_ib,mlx5_fpga_tools
mlxfw 20480 1 mlx5_core
ib_iser 49152 0
rdma_cm 61440 2 ib_iser,rdma_ucm
libiscsi_tcp 24576 1 iscsi_tcp
libiscsi 53248 3 libiscsi_tcp,iscsi_tcp,ib_iser
scsi_transport_iscsi 98304 4 iscsi_tcp,ib_iser,libiscsi
ib_ipoib 163840 0
ib_cm 53248 3 rdma_cm,ib_ucm,ib_ipoib
ib_umad 24576 0
mlx4_ib 208896 0
ib_core 282624 11 rdma_cm,ib_cm,iw_cm,mlx4_ib,mlx5_ib,ib_ucm,ib_iser,ib_umad,ib_uverbs,rdma_ucm,ib_ipoib
libcrc32c 16384 1 raid456
mlx4_en 135168 0
vxlan 49152 2 mlx4_en,mlx5_core
ptp 20480 3 igb,mlx4_en,mlx5_core
libahci 32768 1 ahci
mlx4_core 348160 2 mlx4_en,mlx4_ib
mlx_compat 24576 16 rdma_cm,ib_cm,iw_cm,mlx4_en,mlx4_ib,mlx5_ib,mlx5_fpga_tools,ib_ucm,ib_core,ib_iser,ib_umad,ib_uverbs,mlx4_core,mlx5_core,rdma_ucm,ib_ipoib