ESX 5.1 IPoIB driver crash

stockhausen1 · August 8, 2013, 7:33pm

Hello,

after two weeks of testing and firmware patching I think we found some major bug in the ESX 5.1 OFED 1.8.1.0 IPoIB driver. We are currently running on a Fujitsu RX300 S6 (Dual Xeon X5670) and a Mellanox ConnectX-2 MHRH2A (Firmware 2.9.1200). The storage server is running Ubuntu 12.04 LTS with an older ConnectX (PCIe Gen2) card and Linux Kernel 3.5. In between an 24 Port DDR Flextronics IB CX4 Switch. Therefore our max MTU is limited to 2K but that is no problem for us.

On the ESX the Infiniband card serves as a VMKernel interface and as a VM port group at the same time. A running VM has its “local” disks mounted over the VMKernel interface via IPoIB. Inside the VM we have mounted a NFS filesystem from the NFS server. So it looks like:

vm:~ # df

Filesystem 1K-blocks Used Available Use% Mounted on

/dev/sda1 61927388 3577888 55203784 7% / (mounted by ESX)

10.10.30.253:/var/nas/backup 11007961088 6360753152 4647207936 58% /backup (mounted inside VM)

To reproduce the error we copy data into the VM using SCP and use /backup as a target. After copying some gigabytes of data the Infiniband card stops working and the ESX kernel gives the following error message. Ths situation cannot be solved without ESX reboot.

WARNING: LinDMA: Linux_DMACheckContraints:149:Cannot

map machine address = 0x15ffff37b0, length = 65160

for device 0000:02:00.0; reason = buffer straddles

device dma boundary (0xffffffff)

<3>vmnic_ib1:ipoib_send:504: found skb where it does not belong

tx_head = 323830, tx_tail =323830

<3>vmnic_ib1:ipoib_send:505: netif_queue_stopped = 0

Backtrace for current CPU #20, worldID=8212, ebp=0x41220051b028

ipoib_send@#+0x5d4 stack: 0x41800c4524aa, 0x4f0f5000000d

ipoib_send@#+0x5d4 stack: 0x41800c44bca8, 0x41000fe5d6c0

ipoib_start_xmit@#+0x53 stack: 0x41220051b238, 0x41800c4

In the process of eleminating the error we tried (without success)

Updated servers firmware to latest version
Switched from ConnectX to ConnectX-2 card
Switched from firmware 2.9.1000 to 2.9.1200

Everything works fine if we use the infiniband card only as a VMKernel interface. More details in my first post: Infrastructure & Networking - NVIDIA Developer Forums

Any help is appreciated.

yairi · August 20, 2013, 7:58am

wonderful!

I will check with the folks if they have an ETA for a permanent fix.

yairi · August 18, 2013, 7:11pm

Hi Markus,

Thank you for taking the time and posting. I poked around with some smart engineers and was able to get some idea in addition to the data you provided.

The issue here was the SCSI mid-layer modifying the DMA device dma_boundary attribute under IPoIB (from 64bit to 32bit).

This phenomenon was due to SRP adding a new SCSI host while keeping the dma_boundary attribute of scsi_host template at default.

In this case SCSI mid-layer will override the DMA device dma_boundary to default (32bit boundary) – causing IPoIB allocation across the 32bit boundary to fail and possibly crash.

In order to avoid this problem, it is recommended to uninstall SRP (if no need for it) using:

$ esxcli software vib remove –n scsi-ib-srp

$ reboot

I hope that it will help…

Cheers!

stockhausen1 · August 19, 2013, 5:29pm

Fine,

after 200GB of transferred data without problems I can confirm that your workaround fixes our problem. We do not need SRP, so no more headaches. Maybe you could give other interested users a tip if this will be fixed in a future driver version.

Thanks.

yairi · August 20, 2013, 8:04am

Guys are saying it is coming soon (matter of 2-3 weeks). hold on…

inbusiness.jhchoi · September 11, 2013, 7:55pm

Do you mean that new driver will be release for vSphere 5.x?

Topic		Replies	Views
Which ESXi driver to use for SRP/iSER over IB (not Eth!)?	57	912	April 10, 2022
Poor infiniband performance on Vmware esxi 5.1	2	285	June 13, 2014
ib_srp disconnect on ESXi 5.0 with H:0x5 D:0x0 P:0x0 error esxcli	18	626	December 6, 2018
Will there be ESXi 5.x driver support for ConnectX adapters? Mellanox OFED	17	357	May 17, 2013
40GbE optimization and bandwidth testing Ethernet Adapter Cards	17	1156	May 14, 2013
40Gb/s IPoIB only gives 5Gb/s real throughput?! InfiniBand/VPI Adapter Cards iterations , bytes , qp	9	1678	December 8, 2016
Omnios + RSF-1 + Inifiniband	8	293	April 24, 2015
ESXi 6 SRP initiator to Solaris COMSTAR SRP Target will crash ESXi 6 host with PSOD	4	281	April 10, 2017
VMware ESXi 6.0 virtual ib_ipoib interfaces esxcli	5	549	August 16, 2016
SR-IOV on ESXi 5.1	16	266	May 8, 2013

ESX 5.1 IPoIB driver crash

Related topics