mlx4_0 Initializing and... nothing, fails? on Centos on Dell servers, MT25408

hi all,

I’ve a a very basic setup, directly two boxes via two MHEH28-XTC and I cannot activate them.

One peculiar thing is I get (randomly & !often):

[85947.090496] AMD-Vi: Event logged [

[85947.090539] IO_PAGE_FAULT device=09:00.7 domain=0x0000 address=0x00000000f6ffb000 flags=0x0050]

[85947.298509] AMD-Vi: Event logged [

[85947.298550] IO_PAGE_FAULT device=09:00.7 domain=0x0000 address=0x00000000f6ffb000 flags=0x0050]

which is the card itself, judging by the device id

Would you have and share some thoughts please?

$ ./flint/mstflint -d 09:00.0 q # for both cards

-W- Running quick query - Skipping full image integrity checks.

Image type: FS2

FW Version: 2.9.1000

Device ID: 25408

Description: Node Port1 Port2 Sys image

GUIDs: 0008f104039a62a0 0008f104039a62a1 0008f104039a62a2 0008f104039a62a3

MACs: 000000000000 000000000001

VSD:

PSID: MT_04A0110001

$ ibstat

CA ‘mlx4_0’

CA type: MT25408

Number of ports: 2

Firmware version: 2.9.1000

Hardware version: a0

Node GUID: 0x0008f104039a08dc

System image GUID: 0x0008f104039a08df

Port 1:

State: Initializing

Physical state: LinkUp

Rate: 10

Base lid: 1

LMC: 0

SM lid: 1

Capability mask: 0x0259086a

Port GUID: 0x0008f104039a08dd

Link layer: InfiniBand

Port 2:

State: Down

Physical state: Polling

Rate: 10

Base lid: 0

LMC: 0

SM lid: 0

Capability mask: 0x0259086a

Port GUID: 0x0008f104039a08de

Link layer: InfiniBand

in opensm log:

Jan 06 17:00:28 817185 [F6D5A700] 0x01 → sm_mad_ctrl_send_err_cb: ERR 3113: MAD completed in error (IB_TIMEOUT): SubnGet(NodeInfo), attr_mod 0x0, TID 0x1cd1

Jan 06 17:00:28 817200 [F6D5A700] 0x01 → sm_mad_ctrl_send_err_cb: ERR 3120 Timeout while getting attribute 0x11 (NodeInfo); Possible mis-set mkey?

many thanks

I would suggest few things

  1. Install latest stable MOFED-3.4 on both sides. It is not clear why you have to use ‘./’ on your command line as ‘flint’ command is a part of MOFED.

  2. Start opensmd service and verify that port is in ACTIVE state

  3. Try replace the cable

  4. Connect ports back-2-back on the host where opensmd is running. What is the status of the port?

  5. You also might check if MOFED-2.4 solves the issue