Mellanox OFED 4.4 and greater has a compatibility problem with lustre. This is at least with 2.10.x . The last version of mellanox ofed that works is 4.3-1.0.1.0. I'm not sure how to raise a bug, but maybe an engineer will see this.

  • mounting /scratch (first)

[abc@dtn1 ~ ]$ sudo mount -t lustre csmds1.ib@o2ib:csmds2.ib@o2ib:/scratch /scratch

[sudo] password for abc:

mount.lustre: mount csmds1.ib@o2ib:csmds2.ib@o2ib:/scratch at /scratch failed: Input/output error

Is the MGS running?

Apr 4 10:22:30 dtn1 kernel: LNet: HW NUMA nodes: 2, HW CPU cores: 16, npartitions: 2

Apr 4 10:22:30 dtn1 kernel: alg: No test for adler32 (adler32-zlib)

Apr 4 10:22:30 dtn1 kernel: alg: No test for crc32 (crc32-table)

Apr 4 10:22:30 dtn1 kernel: alg: No test for crc32 (crc32-pclmul)

Apr 4 10:16:14 dtn1 kernel: Lustre: Lustre: Build Version: 2.10.7

Apr 4 10:16:14 dtn1 kernel: LNet: Added LNI 172.16.3.19@o2ib [8/256/0/180]

Apr 4 10:16:16 dtn1 kernel: LNet: 4565:0:(o2iblnd_cb.c:3192:kiblnd_check_conns()) Timed out tx for 172.16.3.219@o2ib: 4294746 seconds

Apr 4 10:16:16 dtn1 kernel: Lustre: 4575:0:(client.c:2116:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1554398174/real 1554398176] req@ffff88086df21c80 x1629904619700240/t0(0) o250->MGC172.16.3.219@o2ib@172.16.3.219@o2ib:26/25 lens 520/544 e 0 to 1 dl 1554398179 ref 1 fl Rpc:eXN/0/ffffffff rc 0/-1

Apr 4 10:16:20 dtn1 kernel: LustreError: 4534:0:(mgc_request.c:251:do_config_log_add()) MGC172.16.3.219@o2ib: failed processing log, type 1: rc = -5

Apr 4 10:16:29 dtn1 kernel: LustreError: 4604:0:(mgc_request.c:603:do_requeue()) failed processing log: -5

Apr 4 10:16:42 dtn1 kernel: LNet: 4565:0:(o2iblnd_cb.c:3192:kiblnd_check_conns()) Timed out tx for 172.16.3.220@o2ib: 4294772 seconds

Apr 4 10:16:42 dtn1 kernel: Lustre: 4575:0:(client.c:2116:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1554398199/real 1554398202] req@ffff88105dba8cc0 x1629904619700304/t0(0) o250->MGC172.16.3.219@o2ib@172.16.3.220@o2ib:26/25 lens 520/544 e 0 to 1 dl 1554398204 ref 1 fl Rpc:eXN/0/ffffffff rc 0/-1

Apr 4 10:16:51 dtn1 kernel: LustreError: 15c-8: MGC172.16.3.219@o2ib: The configuration from log ‘scratch-client’ failed (-5). This may be the result of communication errors between this node and the MGS, a bad configuration, or other errors. See the syslog for more information.

Apr 4 10:16:51 dtn1 kernel: Lustre: Unmounted scratch-client

Apr 4 10:16:51 dtn1 kernel: LustreError: 4534:0:(obd_mount.c:1582:lustre_fill_super()) Unable to mount (-5)

Then …

retry mount twice …

Apr 4 10:31:03 dtn1 kernel: LustreError: 4660:0:(obd_config.c:1361:class_process_proc_param()) scratch-clilov-ffff88086e728800: unknown config parameter ‘lov.qos_threshold_rr=100’

Apr 4 10:31:03 dtn1 kernel: Lustre: Mounted scratch-client

Apr 4 10:31:23 dtn1 kernel: Lustre: 4618:0:(client.c:2116:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1554399063/real 0] req@ffff880864b0a680 x1629905458561424/t0(0) o8->scratch-OST0001-osc-ffff88086e728800@172.16.3.222@o2ib:28/4 lens 520/544 e 0 to 1 dl 1554399083 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1

Apr 4 10:31:33 dtn1 kernel: LustreError: 4664:0:(llite_lib.c:1773:ll_statfs_internal()) obd_statfs fails: rc = -5

Apr 4 10:32:35 dtn1 kernel: LustreError: 4665:0:(llite_lib.c:1773:ll_statfs_internal()) obd_statfs fails: rc = -5

Apr 4 10:32:36 dtn1 kernel: LustreError: 4666:0:(llite_lib.c:1773:ll_statfs_internal()) obd_statfs fails: rc = -5

Apr 4 10:32:50 dtn1 kernel: Lustre: Unmounted scratch-client

Apr 4 10:32:56 dtn1 kernel: LustreError: 4687:0:(obd_config.c:1361:class_process_proc_param()) scratch-clilov-ffff880865264c00: unknown config parameter ‘lov.qos_threshold_rr=100’

Apr 4 10:32:56 dtn1 kernel: Lustre: Mounted scratch-client

  • Mounts works as normal now

Extra info, before attempting the mount on a fresh boot:

[root@dtn1 ~]# mount -t lustre csmds1.ib@o2ib:csmds2.ib@o2ib:/scratch /scratch

mount.lustre: mount csmds1.ib@o2ib:csmds2.ib@o2ib:/scratch at /scratch failed: Input/output error

Is the MGS running?

[root@dtn1 ~]# lctl ping 172.16.3.219@o2ib

12345-0@lo

12345-172.16.3.219@o2ib

[root@dtn1 ~]# lctl ping 172.16.3.220@o2ib

failed to ping 172.16.3.220@o2ib: Input/output error

[root@dtn1 ~]# lctl ping 172.16.3.220@o2ib

12345-0@lo

12345-172.16.3.220@o2ib

FWIW, this is on a client with an EDR adapter. But we’ve also seen this behavior on a host with FDR.

Chris

Hi Christopher,

  1. Are the systems identical when running OFED 4.3 and 4.4 and above?
  2. Please provide sysinfo snapshot from the affected host with OFED 4.5 and from good host OFED 4.3

The sysinfo tool takes a snapshot of your server with all the relevant information on Mellanox HCA.

To use the tool, please follow the instructions below:

  1. Download Sysinfo-Snapshot to the server and click on “download” at the bottom left.

(Tool is attached below)

  1. Untar the file by invoking: tar -zxvf sysinfo-snapshot-.tgz

  2. Run the script: ./sysinfo-snapshot.py [flags options below]

3.1) –d | --dir sets destination directory (default is /tmp).

3.2) –v | --version prints the tool’s version and exit

3.3) –fw | --firmware adda firmware commands/functions to the output

3.4) –no_ib does not run server InfiniBand command

3.5) –json adds JSON file to the output

  1. You will get an output file named sysinfo-snapshot-v-HOSTNAME-DATE.tgz located under /tmp directory, where HOSTNAME is the name of the host and DATE is the date in format YYYYMMDD-HHMM. Output directory can be changed by using ‘-d’ parameter.

  2. Please send us the output file. ​

Thanks & Regards,

Namrata Motihar.