ib0 stops working on centos 7.6

Hi, we have a centos 7.6 running, with kernel 3.10.0-957.10.1.el7.x86_64 and mellanox drivers MLNX_OFED_LINUX-4.5-1.0.1.0-rhel7.6-ext installed with kernel support.

Sometimes ib0 stops working with dmesg like:

[Mon Jun 24 20:07:01 2019] Hardware name: Supermicro X10DRi/X10DRi, BIOS 1.0c 12/30/2014

[Mon Jun 24 20:07:01 2019] Call Trace:

[Mon Jun 24 20:07:01 2019] [] dump_stack+0x19/0x1b

[Mon Jun 24 20:07:01 2019] [] __warn+0xd8/0x100

[Mon Jun 24 20:07:01 2019] [] warn_slowpath_fmt+0x5f/0x80

[Mon Jun 24 20:07:01 2019] [] dev_watchdog+0x248/0x260

[Mon Jun 24 20:07:01 2019] [] ? dev_deactivate_queue.constprop.26+0x60/0x60

[Mon Jun 24 20:07:01 2019] [] call_timer_fn+0x38/0x110

[Mon Jun 24 20:07:01 2019] [] ? dev_deactivate_queue.constprop.26+0x60/0x60

[Mon Jun 24 20:07:01 2019] [] run_timer_softirq+0x24d/0x300

[Mon Jun 24 20:07:01 2019] [] __do_softirq+0xf5/0x280

[Mon Jun 24 20:07:01 2019] [] call_softirq+0x1c/0x30

[Mon Jun 24 20:07:01 2019] [] do_softirq+0x65/0xa0

[Mon Jun 24 20:07:01 2019] [] irq_exit+0x105/0x110

[Mon Jun 24 20:07:01 2019] [] smp_apic_timer_interrupt+0x48/0x60

[Mon Jun 24 20:07:01 2019] [] apic_timer_interrupt+0x162/0x170

[Mon Jun 24 20:07:01 2019] [] ? hrtimer_start_range_ns+0x1ed/0x3c0

[Mon Jun 24 20:07:01 2019] [] ? cpuidle_enter_state+0x57/0xd0

[Mon Jun 24 20:07:01 2019] [] ? cpuidle_enter_state+0x4d/0xd0

[Mon Jun 24 20:07:01 2019] [] cpuidle_idle_call+0xde/0x230

[Mon Jun 24 20:07:01 2019] [] arch_cpu_idle+0xe/0xc0

[Mon Jun 24 20:07:01 2019] [] cpu_startup_entry+0x14a/0x1e0

[Mon Jun 24 20:07:01 2019] [] rest_init+0x77/0x80

[Mon Jun 24 20:07:01 2019] [] start_kernel+0x44b/0x46c

[Mon Jun 24 20:07:01 2019] [] ? repair_env_string+0x5c/0x5c

[Mon Jun 24 20:07:01 2019] [] ? early_idt_handler_array+0x120/0x120

[Mon Jun 24 20:07:01 2019] [] x86_64_start_reservations+0x24/0x26

[Mon Jun 24 20:07:01 2019] [] x86_64_start_kernel+0x154/0x177

[Mon Jun 24 20:07:01 2019] [] start_cpu+0x5/0x14

[Mon Jun 24 20:07:01 2019] —[ end trace d2a01428c663f75b ]—

[Mon Jun 24 20:07:01 2019] ib0: transmit timeout: latency 26 msecs

[Mon Jun 24 20:07:01 2019] ib0: queue (5) stopped, tx_head 179121235, tx_tail 179121170

[Mon Jun 24 20:07:11 2019] ib0: transmit timeout: latency 7 msecs

If we take down ib0 and want to restart it, the whole server freezes and needs to be rebootet. Any ideas what could be the issue for that?

At the moment we cant use build in drivers from centos because they dont bring the performance we need.

Best regards,

Volker

Hi,

we are experiencing the same behaviour on many nodes all with the following setup:

  • RHEL 7.6
  • kernel 3.10.0-957.27.2.el7.x86_64
  • OFED 4.5.0.1.0
  • ConnectX-3 Pro card
  • IPoIB in connected mode

Is this related to issue #1538559? I saw it for OFED 4.6 but we are on 4.5.

Thanks for any info.

Regards,

Ivano