mlx5 drivers crashing on CentOS 8 when running fio test

Running MLX5 cards on CentOS 8.1 and am seeing the following errors (thousands) when running an fio test from an NFS client to the NFS server. NFS is v4 and NFSoRDMA is being used. Cards (ConnectX6) have latest firmware installed. The following errors are from the dmesg output in both the client and server:

Kernel Version: 4.18.0-147.el8.x86_64

=============

mst status -v

MST modules:


MST PCI module is not loaded

MST PCI configuration module loaded

PCI devices:


DEVICE_TYPE MST PCI RDMA NET NUMA

ConnectX6(rev:0) /dev/mst/mt4123_pciconf1 af:00.0 mlx5_1 net-ib1 1

ConnectX6(rev:0) /dev/mst/mt4123_pciconf0 3b:00.0 mlx5_0 net-ib0 0

===============

MLNX_OFED_LINUX-5.0-2.1.8.0

=================

[Sun Jul 26 20:40:45 2020] infiniband mlx5_0: dump_cqe:286:(pid 28589): dump error cqe

[Sun Jul 26 20:40:45 2020] 00000000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

[Sun Jul 26 20:40:45 2020] 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

[Sun Jul 26 20:40:45 2020] 00000020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

[Sun Jul 26 20:40:45 2020] 00000030: 00 00 00 00 00 00 d7 01 00 02 ae 2e 00 02 bf e3

[Sun Jul 26 20:40:45 2020] infiniband mlx5_0: dump_cqe:286:(pid 28589): dump error cqe

[Sun Jul 26 20:40:45 2020] 00000000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

[Sun Jul 26 20:40:45 2020] 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

[Sun Jul 26 20:40:45 2020] 00000020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

[Sun Jul 26 20:40:45 2020] 00000030: 00 00 00 00 00 00 d7 01 00 02 ae 2f 00 02 2b e3

[Sun Jul 26 20:40:45 2020] infiniband mlx5_0: dump_cqe:286:(pid 28589): dump error cqe

[Sun Jul 26 20:40:45 2020] 00000000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

[Sun Jul 26 20:40:45 2020] 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

[Sun Jul 26 20:40:45 2020] 00000020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

[Sun Jul 26 20:40:45 2020] 00000030: 00 00 00 00 00 00 d7 01 00 02 ae 30 00 02 8d e2

Looks this may be an issue with using NFS v4. After switching to v3 I was able to run the tests without any issues.

Hello,

When cqe error occurs, Mellanox support is doing an internal parsing in order to understand why it happened but unfortunately this is not something that could be done via community.

In case there is a valid support contract, please send an email to support@mellanox.com and Mellanox support will be happy to assist.

In general we are not familiar with an issue with a specific NFS version.

Best Regards.