Why mellanox fuses blow and the driver causes PANIC

We have lost a LOT of Mellanox 40Gb NICs, only using Cisco Bi-Di 3.5w optics. When we change to 1.5w MPO SFPs we solve that issue. (at a heavy cost to mgmt)

I have to ask WHY Mellanox isnt resolving the PANIC as that causes NODE loss. Thats the real problem, the NIC loss isnt as painful as a NODE PANIC. Surely you know about this and I am still dealing with it. When will Mellanox address the PANIC as thats the thorn in our side.

2019-01-08T10:28:04-08:00 <0.6> syslogd: kernel boot file is /boot/kernel.amd64/kernel

2019-01-08T10:28:04-08:00 <0.7> /boot/kernel.amd64/kernel: panic @ time 1546971674.663, thread 0xfffff8052123a780: vm_fault: fault on nofault entry, addr: fffffe002c04d000

2019-01-08T10:28:04-08:00 <0.7> /boot/kernel.amd64/kernel: cpuid = 12

2019-01-08T10:28:04-08:00 <0.7> /boot/kernel.amd64/kernel: Panic occurred in module kernel loaded at 0xffffffff80200000:

2019-01-08T10:28:04-08:00 <0.7> /boot/kernel.amd64/kernel:

2019-01-08T10:28:04-08:00 <0.7> /boot/kernel.amd64/kernel: Stack: --------------------------------------------------

2019-01-08T10:28:04-08:00 <0.7> /boot/kernel.amd64/kernel: kernel:vm_fault_hold+0x17fc

2019-01-08T10:28:04-08:00 <0.7> /boot/kernel.amd64/kernel: kernel:vm_fault+0x76

2019-01-08T10:28:04-08:00 <0.7> /boot/kernel.amd64/kernel: kernel:trap_pfault+0x2a1

2019-01-08T10:28:04-08:00 <0.7> /boot/kernel.amd64/kernel: kernel:trap+0x64c

2019-01-08T10:28:04-08:00 <0.7> /boot/kernel.amd64/kernel: kernel:show_diag_rprt+0x1b

2019-01-08T10:28:04-08:00 <0.7> /boot/kernel.amd64/kernel: kernel:sysctl_root+0x246

2019-01-08T10:28:04-08:00 <0.7> /boot/kernel.amd64/kernel: kernel:userland_sysctl+0x1d1

2019-01-08T10:28:04-08:00 <0.7> /boot/kernel.amd64/kernel: kernel:sys___sysctl+0x73

2019-01-08T10:28:04-08:00 <0.7> /boot/kernel.amd64/kernel: kernel:amd64_syscall+0x396

thx

Mark Licata

Hi Mark,

Please provide the following information:

  1. What is the Part Number of Cisco optics?
  2. Output of #ibv_devinfo​
  3. Output of #ofed_info​
  4. OS and kernel version
  5. Does the issue reproduce as soon as you replace the 1.5w MPO SFPs with Cisco Bi-Di 3.5w optics? If not, please elaborate on steps to reproduce the issue.

Thanks,

Namrata.