InfiniHost III Ex - Suspend/Resume not working on Debian Linux

Hello,

I’m using an InfiniHost III Ex / MT25208 on Debian/Jessie and after running ‘pm-suspend’ and then resuming, my network stops responding. If I try to use ibstat or ibstatus then I will experience hangs and then finally an error message appears related to ib_mthca:

ib_mthca 0000:02:00.0: HW2SW_MPT failed (-16)

Here is a list of modules loaded on startup:

mlx4_ib

ib_umad

ib_ipoib

I’ve also tried unloading the modules before suspending like this:

/etc/init.d/opensm stop

modprobe -r ib_ipoib

modprobe -r ib_umad

modprobe -r mlx4_ib

modprobe -r ib_mthca

But when I reload the modules my ib1 interface does not appear. This happens even if I don’t suspend.

Btw, I’ve attempted to update the firmware but I can’t get anything to work. Examples:

lspci -d 15:b3 = nothing

ibv_devinfo | grep hca_id = Failed to get IB devices list: Function not implemented.

mstflint -d 02:00.0 q = -E- Cannot open Device: 02:00.0. File exists MFE_OLD_DEVICE_TYPE

plain lspci =

02:00.0 InfiniBand: Mellanox Technologies MT25208 [InfiniHost III Ex] (rev a0)

Here’s a more complete log output:

Nov 18 14:28:29 alin kernel: [ 9.977168] ib_mthca: Mellanox InfiniBand HCA driver v1.0 (April 4, 2008)

Nov 18 14:28:29 alin kernel: [ 9.977170] ib_mthca: Initializing 0000:02:00.0

Nov 18 14:28:29 alin kernel: [ 11.374057] ib_mthca 0000:02:00.0: HCA FW version 5.1.000 is old (5.3.000 is current).

Nov 18 14:28:29 alin kernel: [ 11.374059] ib_mthca 0000:02:00.0: If you have problems, try updating your HCA FW.

Nov 18 14:29:10 alin kernel: [ 59.296536] ib1: ib_dealloc_pd failed

Nov 18 14:31:22 alin kernel: [ 167.880313] ib_mthca 0000:02:00.0: SW2HW_MPT failed (-16)

Nov 18 14:33:16 alin kernel: [ 281.265414] ib_mthca 0000:02:00.0: HW2SW_MPT failed (-16)

Nov 18 14:33:22 alin kernel: [ 287.885556] ib_mthca 0000:02:00.0: SW2HW_MPT failed (-16)

Nov 18 14:34:16 alin kernel: [ 341.266202] ib_mthca 0000:02:00.0: HW2SW_MPT failed (-16)

Nov 18 14:34:22 alin kernel: [ 347.886276] mthca0: ib_query_port 1 failed

It suggests a firmware update and you can see more errors.

I don’t have the ‘mst’ command. I installed the debian package mstflint:

mstflint - Mellanox firmware burning application

Which comes with: mstconfig mstflint mstmcra mstmread mstmtserver mstmwrite mstregdump mstvpd

Rebooting does solve the problem.

I should mention, if I don’t put an IP address on the card and connect to the network, I can unload the modules in this order (unlike my example above):

modprobe -r ib_ipoib

modprobe -r ib_umad

modprobe -r mlx4_ib

Nevertheless, if I load the modules once again in the correct order I don’t get an IB0 or IB1 interface and ibstatus shows:

Fatal error: device '’: sys files not found (/sys/class/infiniband//ports)

/usr/sbin/ibstatus: 21: exit: Illegal number: -1

Note: this is all without suspend/resume being involved. So basically, I can only load the modules once and have connectivity, subsequent reloads will render the card unresponsive and nothing shows up in the log files or dmesg. If I can solve that problem, then I could probably get suspend/resume to work.

Hi,

It’s a bit hard to understand what actually happened without looking at the full kernel log. but the first issue looks like a memory issue with QP registrations which was most likely caused by an issue previous to that. most commonly would be the firmware getting stuck, PCI issue etc…I would swap the HCA with another one to see if the issue follows the card or not.

as for upgrading, this is a really old HCA, so newer MFT versions will most likely not work with it.Are you still in that state even after the server is rebooted ? what does “mst status” show ?

Hi,

Thanks for the explanation.

I’m not totally sure how this old HCA FW handles a state where modules are shutdown from pm-suspend.

I would start with going to reboot this server and going to step 1. making sure that I have the latest OFED for your Debian OS and FW before attempting to do these kind of tests.

if you can list exactly what you have we may be able to locate the necessary drivers (although they’re antics)