Infiniband/NFSoRDMA causing crash/reboot of server?

scar1 · May 18, 2017, 10:49pm

Within the last 8 months or so, I recently upgraded our cluster with ConnectX-3 MCX354A-FCBT cards and 36-port SwitchX MSX6025T-1SFS unmanaged switch–all brand new in box. We have been experiencing crashes/reboots of two of the nodes and I’m wondering if it’s related to the infiniband? Currently the cluster only uses the fabric for NFSoRDMA.

There are 9 nodes in total connected at 40Gb FDR10. The OpenSM manager is running on two of the nodes. The nodes are running CentOS 6.9 with 2.6.32-696.1.1.el6.x86_64 kernel and CentOS nfs-rdma package/kernel modules. The HCA’s all have recent firmware 2.40.7000.

One compute node has been returned to the vendor twice and they are now replacing all the hardware except the HCA since I purchased/installed that separately (but we tried swapping HCA’s with the other nodes and the same node crashed/rebooted).

Now our NFS server node has crashed twice in 24 hrs and currently the infiniband is not connected until this can be resolved, and we are using the 1Gb port instead. Both machines are less than 6 months old and the NFS server has been connected to the IB fabric less than a month.

The NFS server uses an Asus Z10PE-D16 board with single Intel Xeon E5-2620v4 CPU.

I attached the latest boot.log from the NFS server.

I’m wondering if heavy load on the NFS can cause this? At the time of the last crash the load avg of the server was about 70% with 8 nfsd processes busy. But we have run similar jobs in the past month that didn’t cause reboot…

Could it be lack of RAM? The NFS server has 64GB but I would think the OS would manage it accordingly…

I just noticed the cables the vendor sold us are MC2210130 40Gb ethernet, could that be it? Should it be IB FDR10 cable, MC2206130 for example?

Would appreciate your insights so I can figure out why our server keeps crashing. Let me know if you’d like more details about anything.

Thanks

boot.log.zip (16.5 KB)

spruitt · May 30, 2017, 7:24pm

Hi Chandler,

Are you using Mellanox OFED Driver and if so what version? (#ofed_info | head -1) or are you using the Inbox driver embedded within the OS?

The boot log file attached does not contain the Kernel Trace upon crashing.

If this is an Infiniband Network, cable “MC2206130” is a Mellanox® passive copper cable, IB QDR/FDR10, 40Gb/s, QSFP and cable “MC2210130” is a Mellanox Passive Copper cable, 40Gb/s, QSFP only (No VPI).

##Suggestions:

Confirm the IB fabric health: (run this utility and options while the cluster is active)

#ibdiagnet -r --pc -P all=1 --pm_pause_time 1200 -o /tmp/$(date +%Y%m%d)

Review the ibdiagnet.2 log file from the folder created.

The messages file should have reported a Kernel trace, review kernel trace and events prior to the system crashing.
Review the traffic pattern at the time of the issue.
Review if there were any other particular changes within your environment.

Regards,

Sophie.

Topic		Replies	Views
Windows VMs hang out NFSoRDMA on CentOS 6.5 Mellanox OFED	2	314	September 24, 2014
Mixing OFED 1.5.3 and 2.2 in the same network? Mellanox OFED	4	442	June 18, 2017
Hello. We have problems on our old HPC cluster. Adapters and Cables	7	541	September 20, 2019
XenServer support	13	269	January 14, 2014
A newbie problem with infiniband.	0	432	March 19, 2014
New to infiniband, can't get a working connection.	22	2077	September 9, 2013
Installation and fireware update problems InfiniBand/VPI Adapter Cards ibv_devinfo	6	1155	March 22, 2013
Why is NFSoRDMA in CentOS 7.6.1810 limited to 10 Gbps? Software And Drivers	2	330	September 8, 2019
how to debug "hangs" related to infiniband adapter Ethernet Adapter Cards	4	828	April 13, 2020
IP over Infiniband @ FreeBSD 11.2: fatal Kernel trap 12 after packet length >2044 bytes in connected mode Software And Drivers infiniband , ip	3	701	September 9, 2019

Infiniband/NFSoRDMA causing crash/reboot of server?

Related topics