ib_srp disconnect on ESXi 5.0 with H:0x5 D:0x0 P:0x0 error

Hello everyone,

I’m using the 1.8.2.0 for ESXi 5.X ib_srp drivers with the ESXi 5.0.0 1311175 servers and every couple of days one of my initiators is disconnected from the storage with an error similar to “2013-11-29T14:09:51.001Z cpu36:8451)NMP: nmp_ThrottleLogForDevice:2318: Cmd 0x2a (0x4125c10b3c00, 8256) to dev “eui.3731346538376162” on path “vmhba_mlx4_0.1.1:C0:T2:L4” Failed: H:0x5 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0. Act:EVAL”. I’ve seen it since the 1.8.1.0 version came out at a various ESXi 5.0 builds (from U1 623860), ConnectX-2 QDR (MT26428) and ConnectX-3 FDR10 (MT27500) HCAs, HP and Dell blade servers, 8 and 16 node clusters and after doing a lot of digging I just have no clue what can be the cause. Please, see the attached log.

Modules are set as follows:

~ # esxcli system module parameters list -m ib_srp

Name Type Value Description


dead_state_time int 3 Number of minutes a target can be in DEAD state before moving to REMOVED state

debug_level int Set debug level (1)

heap_initial int Initial heap size allocated for the driver.

heap_max int Maximum attainable heap size for the driver.

max_srp_targets int 128 Max number of srp targets per scsi host (ie. HCA)

max_vmhbas int Maximum number of vmhba(s) per physical port (0<x<8)

mellanox_workarounds int 1 Enable workarounds for Mellanox SRP target bugs if != 0

srp_can_queue int 256 Max number of commands can queue per scsi_host ie. HCA

srp_cmd_per_lun int 64 Max number of commands can queue per lun

srp_sg_tablesize int 128 Max number of scatter lists supportted per IO - default is 32

topspin_workarounds int Enable workarounds for Topspin/Cisco SRP target bugs if != 0

use_fmr int 1 Enable/disable FMR support (1)

srp-301.txt.zip (10.7 KB)

Hi xgrv,

Required parameter :

#esxcli system module parameters set -m=“ib_srp” -p “debug_level=2”

On issue reproduce kindly attach vmkernel.log

Hi,

Can you please increase the verbosity level and upload the logs.

Thank You rian https://community.mellanox.com/s/profile/0051T000008EaqdQAC and vlad https://community.mellanox.com/s/profile/0051T000008EZckQAG I’ve enabled debugging on my clusters and will get back to you if I get anything interesting.

Hi!

This is my vSphere 5.5 configurations.


esxcli system module parameters set -m=ib_ipoib -p=“ipoib_recvq_size=1024 ipoib_sendq_size=1024 ipoib_mac_type=0”

esxcli system module parameters set -m=mlx4_core -p=“mtu_4k=1 msi_x=1”

esxcli system module parameters set -m=ib_srp -p=“dead_state_time=5 max_vmhbas=1 srp_can_queue=1024 srp_cmd_per_lun=64 srp_sg_tablesize=32”


Can you try it?

Please, see the attached logs, the issue begins at 2013-12-13T 20:28:25.785Z, where I have the first occurrence of “WARNING: LinScsi: SCSILinuxQueueCommand:1175:queuecommand failed with status = 0x1056 Unknown status vmhba_mlx4_0.1.1:0:4:7 (driver name: ib_srp)” error. From this moment ESXi is having queue problems “Reduced the queue depth for device eui.3966613235306564 to 25, due to queue full/busy conditions” and the only solution is to vmotion all the VM’s from the host and do a reboot. Relocated VMs don’t cause this on their new hosts and no other ESXi servers in the cluster have this problem at the same time.

Hello inbusiness https://community.mellanox.com/s/profile/0051T000008EZgrQAG

Yours ib_ipoib and mlx4_core are identical to mine, I’ll check with the different max_vmhbas and default srp_sg_tablesize as you suggested. Thanks!

Hello everyone,

Thank you very much for the time you’ve spent trying to help. I’m not 100% sure, but I’ve probably found the solution for the “H:0x0 D:0x8 P:0x0” errors and weird HCA behaviour, where errors occurred under no significant storage and target load. Stress testing with fio never showed any problems, that’s why I suspected that the hangs in the BUSY states were caused by something else. I’m using SCST target stack and played with ibdump recently to find out, that the cause was indeed limited to the initiator side. What I did wrong is that I’ve setup the blades to use OS Control for the power management and even setting it to High Performance in the ESXi - servers were throttling in C1/C1E states and probably messing something with the PCI-Express power too. It suppose to have an impact only on the latency, but explicitly disabling these features in the BIOS made my logs clean. It’s passed only 7 days since the change, so it’s still to early to be certain, but I’ve a good feeling.

Thanks again!

Ive found another logline that looks interesting.

2014-05-28T16:25:43.094Z cpu3:4727885)WARNING: LinuxSocket: 1854: UNKNOWN/UNSUPPORTED socketcall op (whichCall=0x12, args@0xffff8d3c)

2014-05-28T16:32:47.418Z cpu5:4261167)WARNING: LinScsi: SCSILinuxQueueCommand:1207: queuecommand failed with status = 0x1056 Unknown status vmhba_mlx4_0.1.1:0:2:0 (driver name: ib_srp) - Message repeated 137263 times

2014-05-28T16:48:05.196Z cpu15:1214355)WARNING: LinScsi: SCSILinuxQueueCommand:1207: queuecommand failed with status = 0x1056 Unknown status vmhba_mlx4_0.1.1:0:2:0 (driver name: ib_srp) - Message repeated 88409 times

2014-05-28T17:03:05.197Z cpu7:32812)WARNING: LinScsi: SCSILinuxQueueCommand:1207: queuecommand failed with status = 0x1056 Unknown status vmhba_mlx4_0.1.1:0:2:0 (driver name: ib_srp) - Message repeated 127160 times

2014-05-28T17:07:47.885Z cpu11:4734778)WARNING: UserEpoll: 542: UNSUPPORTED events 0x40

2014-05-28T17:07:48.729Z cpu0:4734778)WARNING: LinuxSocket: 1854: UNKNOWN/UNSUPPORTED socketcall op (whichCall=0x12, args@0xffdd3c9c)

2014-05-28T17:18:05.205Z cpu8:36687)WARNING: LinScsi: SCSILinuxQueueCommand:1207: queuecommand failed with status = 0x1056 Unknown status vmhba_mlx4_0.1.1:0:2:0 (driver name: ib_srp) - Message repeated 133406 times

2014-05-28T17:33:05.221Z cpu5:45877)WARNING: LinScsi: SCSILinuxQueueCommand:1207: queuecommand failed with status = 0x1056 Unknown status vmhba_mlx4_0.1.1:0:2:0 (driver name: ib_srp) - Message repeated 130912 times

2014-05-28T17:48:05.235Z cpu2:47863)WARNING: LinScsi: SCSILinuxQueueCommand:1207: queuecommand failed with status = 0x1056 Unknown status vmhba_mlx4_0.1.1:0:2:0 (driver name: ib_srp) - Message repeated 19074 times

2014-05-28T18:03:23.899Z cpu11:4738029)WARNING: LinScsi: SCSILinuxQueueCommand:1207: queuecommand failed with status = 0x1056 Unknown status vmhba_mlx4_0.1.1:0:2:0 (driver name: ib_srp) - Message repeated 119658 times

^^

Bye

Thomas

max_vmhbas=1

  • it’s a default value. That means vHBAs count per IB port

srp_can_queue=1024

  • It’s a physical HCA’s total QD(queue depth)

  • Almost latest FC HBA can support 4096 logins

  • But I don’t know what’s a maximum value. Also default one, too.

  • 1024 is just half of old 4Gb FC-HBA’s

srp_cmd_per_lun=64

  • It’s a maximun QD PER LUN.

  • Default value is 8. but I don’t know what’s a maximum value.

  • I was found default value is 63 on general linux system via googling.

srp_sg_tablesize=32

  • It’s a default value on vSphere and General Linux system.

It’s a very important factor on vSphere environment with shared block storage. Every block storage has a QD limit. And the count of hosts and each HBA’s QD value also very important, too.

I think that Mellanox must show a default and maximum value of every parameter on manuals.

ESXi host with the queue problems has been in Maintenance Mode for 4h now and after I issue a “esxcli storage core adapter rescan --all” command, this pops out in the vmkernel.log file:

2014-01-03T14:10:11.389Z cpu56:14437)WARNING: LinScsi: SCSILinuxQueueCommand:1175:queuecommand failed with status = 0x1056 Unknown status vmhba_mlx4_0.1.1:0:1:6 (driver name: ib_srp) - Message repeated 561 times

2014-01-03T14:11:31.334Z cpu42:8234)ScsiDeviceIO: 2311: Cmd(0x4125c0c0b300) 0x28, CmdSN 0x2b from world 1067687 to dev “eui.3731346538376162” failed H:0x0 D:0x8 P:0x0 Possible sense data: 0x0 0x0 0x0.

2014-01-03T14:11:31.366Z cpu42:8234)ScsiDeviceIO: 2311: Cmd(0x4125c084bf40) 0x28, CmdSN 0x2f from world 1067687 to dev “eui.3731346538376162” failed H:0x0 D:0x8 P:0x0 Possible sense data: 0x5 0x25 0x0.

2014-01-03T14:11:31.388Z cpu42:8234)ScsiDeviceIO: 2311: Cmd(0x4125c00e4fc0) 0x28, CmdSN 0x30 from world 1067687 to dev “eui.3731346538376162” failed H:0x0 D:0x8 P:0x0 Possible sense data: 0x0 0x0 0x0.

2014-01-03T14:11:31.755Z cpu42:8234)ScsiDeviceIO: 2311: Cmd(0x4125c0d69000) 0x28, CmdSN 0x32 from world 1067687 to dev “eui.3632656331666463” failed H:0x0 D:0x8 P:0x0 Possible sense data: 0x5 0x25 0x0.

It looks like the ib_srp driver is stuck with some SCSI commands, which are retried even after the VM world isn’t there.

Okay!

  1. What’s your HCA model and firmware?

  2. What’s your IB switch?

  3. What’s your IB SM?

Does it resides in switch or dedicated linux host?

Anyway, do you show me your SM configuration and ESXi configurations?

  1. What’s your IB cable vendor and model?

  2. What’s your SRP Target and OS?

And can your show me a LUN number per Target?

  1. What’s your ESXi Host number?

I think that there is a QUEUE FULL problem on your environments.

I have also question about SRPT maximum value of each parameters.

I’m using OmniOS ZFS SRPT now.

If you give me a some information about you then I’ll help you to solve your some problems…

Hi Konrad,

I discussed your issue with some smart fellows from Mellanox, this is what they had to say about your issue:

  • There is a similar internal record for such issue. also discussed by communities. The issue had to do with SRP host receiving SCSI errors status (0x18) with read/write commands and 0x2 (check condition) for reports LUN commands.

The target also slows down on some of the LUNs resulting with SRP host aborts and vmkernel reported it with error “H:0x5 D:0x0 P:0x0” (H:0x5 is SG_ERR_DID_ABORT)

That specific user also noticed the following errors in the syslog:

2013-12-15T05:46:50.282Z stratus105.api.oktawave.corp vmkwarning: cpu11:4107)WARNING: LinScsi: SCSILinuxQueueCommand:1175:queuecommand failed with status = 0x1056 Unknown status vmhba_mlx4_0.1.1:0:4:8 (driver name: ib_srp) - Message repeated 5956 times

On this particular error, vmware community also report the same issue with different transport SATA due to lun/device BUSY with ESX 5.1 - https://communities.vmware.com/thread/445270 https://communities.vmware.com/thread/445270

However, the resolution provided is downgrade ESX version to ESX 5.0 U2 which does not help.

Please also looking at this vmsupport link that will explain the “H:0x5 D:0x0 P:0x0” error.

http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=289902 http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=289902

you didn’t get any solution from me and maybe the pointed i provide you already have. sorry i couldn’t help further more. Let the group know if you find a solution or a workaround for this issue.

Thanks

Hello again thowie https://community.mellanox.com/s/profile/0051T000008EavxQAC . I’ve managed to find a way to quickly trigger the “H:0x0 D:0x8 P:0x0” errors doing some stress testing using fio. Try to deploy 2-4 VMs and use the ‘fio --verify=md5 --rw=write --size=8000m --bs=4k --loops=60 --runtime=60m --group_reporting --sync=1 --direct=1 --directory=/mnt/sdb1 --ioengine=libaio --numjobs=32 --thread --name=srp’ command on them, the communication will work fine for 10-30 minutes, but after that you should see the 0x8 flood in the ESXi logs and the queues jumping. Firstly, I was convinced that the “DEVICE BUSY” came from the target, which have the logic to trigger such SCSI responses when the req_lim for the LUN is exceeded. But, after playing with the ibdump tool to dump the traffic, it seems that there are no such responses sent between the target and initiator LIDs. You can look for them yourself with “infiniband.bth.opcode == 4 && data.data[0] == c1 && data.data[19] != 0” filter using Wireshark and the pcap dump files.

Changing the ib_srp module parameters on ESXi doesn’t help, still using a Linux initiator instead of ESXi shows that there are no such errors triggered.

“H:0x5 D:0x0 P:0x0 error” was corrected by optimizing the latency, but “H:0x0 D:0x8 P:0x0” looks like an ESXi module bug in how the initiator tracks SRP credits, according to a friend that helped me to track the issue. Hope it will be corrected in the ib_iser module coming soon.

regards

Hello,

did the workaround helps? i have the same Error on esxi 5.5.

the vmkernel.log shows:

2014-05-09T09:39:15.860Z cpu14:32819)<3>vmnic_ib0:vmipoib_start_rx:410: unsupported protocol (0x888)

2014-05-09T09:39:25.950Z cpu12:32799)<3>vmnic_ib0:vmipoib_start_rx:410: unsupported protocol (0x888)

2014-05-09T09:39:32.602Z cpu12:32817)ScsiDeviceIO: 2324: Cmd(0x413680b08cc0) 0x28, CmdSN 0x10aa from world 35932 to dev “naa.600144f098a757880dff532425080001” failed H:0x0 D:0x8 P:0x0 Possible sense data: 0x0 0x0 0x0.

2014-05-09T09:39:36.050Z cpu12:33445)<3>vmnic_ib0:vmipoib_start_rx:410: unsupported protocol (0x888)

2014-05-09T09:39:46.141Z cpu14:32799)<3>vmnic_ib0:vmipoib_start_rx:410: unsupported protocol (0x888)

2014-05-09T09:39:48.304Z cpu18:32823)ScsiDeviceIO: 2324: Cmd(0x413680b08cc0) 0x28, CmdSN 0x10b1 from world 35932 to dev “naa.600144f098a757880dff532425080001” failed H:0x0 D:0x8 P:0x0 Possible sense data: 0x0 0x0 0x0.

2014-05-09T09:39:51.006Z cpu22:32827)ScsiDeviceIO: 2324: Cmd(0x413680b08cc0) 0x28, CmdSN 0x10b2 from world 35932 to dev “naa.600144f098a757880dff532425080001” failed H:0x0 D:0x8 P:0x0 Possible sense data: 0x0 0x0 0x0.

2014-05-09T09:39:56.260Z cpu14:32799)<3>vmnic_ib0:vmipoib_start_rx:410: unsupported protocol (0x888)

2014-05-09T09:40:06.351Z cpu18:32799)<3>vmnic_ib0:vmipoib_start_rx:410: unsupported protocol (0x888)

2014-05-09T09:40:16.441Z cpu12:32799)<3>vmnic_ib0:vmipoib_start_rx:410: unsupported protocol (0x888)

2014-05-09T09:40:26.532Z cpu17:32822)<3>vmnic_ib0:vmipoib_start_rx:410: unsupported protocol (0x888)

2014-05-09T09:40:30.157Z cpu16:32821)ScsiDeviceIO: 2324: Cmd(0x4136856b1080) 0x28, CmdSN 0x10c0 from world 35932 to dev “naa.600144f098a757880dff532425080001” failed H:0x0 D:0x8 P:0x0 Possible sense data: 0x0 0x0 0x0.

and in the vmkwarning.log i see:

2014-05-09T08:48:28.140Z cpu12:39950)WARNING: VSCSIFilter: 1428: Failed to issue ioctl to get unmap readback type: Inappropriate ioctl for device

2014-05-09T08:53:38.629Z cpu0:4537932)WARNING: Hbr: 863: Failed to receive from 192.168.70.97 (groupID=GID-c4550a37-b6f0-41f6-b853-42ba3e028895): Broken pipe

2014-05-09T09:00:18.639Z cpu15:34122)WARNING: LinScsi: SCSILinuxQueueCommand:1207: queuecommand failed with status = 0x1056 Unknown status vmhba_mlx4_0.1.1:0:2:0 (driver name: ib_srp) - Message repeated 77318 times

2014-05-09T09:15:18.651Z cpu16:4168540)WARNING: LinScsi: SCSILinuxQueueCommand:1207: queuecommand failed with status = 0x1056 Unknown status vmhba_mlx4_0.1.1:0:2:0 (driver name: ib_srp) - Message repeated 14515 times

The esxi Server controll the power. And in Bios the ACPI C and ACPI P sates are on…

Hope you have some Information.

THX

Sounds good my friend! Happy to see you were able to workaround the issue.

What scsi target are you using and what are the ib_srp module parameters on the ESXi?

Hello xgrv,

thanks for your reply. You are Crazy ^^ i will test it with fio in the next days… I use a 3 Solaris ZFS Targets with IB… for 4 ESXi 5.5 Server. My parameter looks like

~ # esxcli system module parameters list -m ib_srp

Name Type Value Description


dead_state_time int 5 Number of minutes a target can be in DEAD state before moving to REMOVED state

debug_level int Set debug level (1)

heap_initial int Initial heap size allocated for the driver.

heap_max int Maximum attainable heap size for the driver.

max_srp_targets int Max number of srp targets per scsi host (ie. HCA)

max_vmhbas int 1 Maximum number of vmhba(s) per physical port (0<x<8)

mellanox_workarounds int Enable workarounds for Mellanox SRP target bugs if != 0

srp_can_queue int 1024 Max number of commands can queue per scsi_host ie. HCA

srp_cmd_per_lun int 64 Max number of commands can queue per lun

srp_sg_tablesize int 32 Max number of scatter lists supportted per IO - default is 32

topspin_workarounds int Enable workarounds for Topspin/Cisco SRP target bugs if != 0

use_fmr int Enable/disable FMR support (1)

What did you mean with ““H:0x5 D:0x0 P:0x0 error” was corrected by optimizing the latency” ???

I think i had a similar Problem with FC Cards in ESXi… That was the reason why we change to Infiniband and it is great. Hopefully we found a solution for this bad problem…

Thank you very much xgrv… If i can help you, i will do what i can…

Thanks

Thomas

Original message from: @Konrad Grzybowski​