I’m using the 126.96.36.199 for ESXi 5.X ib_srp drivers with the ESXi 5.0.0 1311175 servers and every couple of days one of my initiators is disconnected from the storage with an error similar to “2013-11-29T14:09:51.001Z cpu36:8451)NMP: nmp_ThrottleLogForDevice:2318: Cmd 0x2a (0x4125c10b3c00, 8256) to dev “eui.3731346538376162” on path “vmhba_mlx4_0.1.1:C0:T2:L4” Failed: H:0x5 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0. Act:EVAL”. I’ve seen it since the 188.8.131.52 version came out at a various ESXi 5.0 builds (from U1 623860), ConnectX-2 QDR (MT26428) and ConnectX-3 FDR10 (MT27500) HCAs, HP and Dell blade servers, 8 and 16 node clusters and after doing a lot of digging I just have no clue what can be the cause. Please, see the attached log.
Modules are set as follows:
~ # esxcli system module parameters list -m ib_srp
Name Type Value Description
dead_state_time int 3 Number of minutes a target can be in DEAD state before moving to REMOVED state
debug_level int Set debug level (1)
heap_initial int Initial heap size allocated for the driver.
heap_max int Maximum attainable heap size for the driver.
max_srp_targets int 128 Max number of srp targets per scsi host (ie. HCA)
max_vmhbas int Maximum number of vmhba(s) per physical port (0<x<8)
mellanox_workarounds int 1 Enable workarounds for Mellanox SRP target bugs if != 0
srp_can_queue int 256 Max number of commands can queue per scsi_host ie. HCA
srp_cmd_per_lun int 64 Max number of commands can queue per lun
srp_sg_tablesize int 128 Max number of scatter lists supportted per IO - default is 32
topspin_workarounds int Enable workarounds for Topspin/Cisco SRP target bugs if != 0
use_fmr int 1 Enable/disable FMR support (1)
srp-301.txt.zip (10.7 KB)