Storage Spaces Direct Windows Server 2016 (1607) BSOD - Mellanox ConnectX-3 Pro (Dell)

Good afternoon,

There is very little documentation specific to Windows Server 2016, much of the RDMA/RoCE documentation referrers to Windows Server 2012(r2) Storage Spaces. So I figured I’d start a conversation in here to help others also looking at Microsoft Storage Spaces Direct (S2D) in Windows Server 2016.

I currently have an open case with Dell ProSupport regarding a BSOD my 2 Node cluster encounters. Either node will just halt and restart after 60 seconds when stress testing the environment. Each server is configured as follows…

  • Dell 13th Gen R730XD
  • 2x 120GB Intel SSDs SSDSC2BB120G6R (OS Mirror)
  • 6x 1.6TB SSDs SSDSC2BX016T4R
  • 6x 8TB HDDs ST8000NM0055-1RM112
  • 2x Intel DC P3700 800GB (Journal / Cache)
  • 256GB 2400Mhz Memory
  • HBA330 Mini Controller
  • 1x Mellanox ConnectX-3 Pro (MT04103) Dual Port SFP+ 10GbE (Firmware Version: 2.26.50.80 / Driver Version: 2.25.12665.0)
  • Running Windows Server 2016 DataCenter 1607 Build 14393.693

Each server has two links to a Dell N4032F Switch.

To rule out a possible fault with my switch config, Dell advised I directly connect the two nodes together. RDMA is engaged because I can see the traffic using performance monitor.

Here’s the order in which I’ve setup my environment…

  1. Install the OS and fully update/patch
  2. Set Windows Power Mode to Performance
  3. Install Windows Features - Hyper-V / File-Services / Failover-Clustering / Data-Center-Bridging
  4. Install Dell drivers for all hardware including the Mellanox nics. (I’ve tried both the Mellanox drivers and Dell’s. They appear to be the same. MLNX_VPI_WinOF-5_25_All_Win2016_x64 / Driver Version: 2.25.12665.0)
  5. I perform the network configuration. Essentially create a Hyper-V SET Switch joined to both ports of the Mellanox nic. I then create two vNics connected to the new Switch with a VLAN tag. (See attached file)
  6. I then create the Failover-Cluster and enable Storage Spaces Direct (See attached file)

Everything appears to be okay then it’ll randomly crash. Below is a memory dump. This is what I receive on either host. I want to upgrade the firmware but it’s a Dell product code so I’m stuck. It’s been three weeks and we still don’t have a working environment. I also have another debug output further below…


  • Bugcheck Analysis *


DRIVER_POWER_STATE_FAILURE (9f)

A driver has failed to complete a power IRP within a specific time.

Arguments:

Arg1: 0000000000000003, A device object has been blocking an Irp for too long a time

Arg2: ffffa48778febe20, Physical Device Object of the stack

Arg3: ffffc080258f4960, nt!TRIAGE_9F_POWER on Win7 and higher, otherwise the Functional Device Object of the stack

Arg4: ffff9c8fe2328010, The blocked IRP

StorageSpacesSetup - Network.txt.zip (1.16 KB)

StorageSpacesSetup - Cluster.txt.zip (889 Bytes)

continued…

Debugging Details:


Implicit thread is now ffff9c8f`e23a8080

DUMP_CLASS: 1

DUMP_QUALIFIER: 401

BUILD_VERSION_STRING: 14393.693.amd64fre.rs1_release.161220-1747

SYSTEM_MANUFACTURER: Dell Inc.

SYSTEM_PRODUCT_NAME: PowerEdge R730xd

SYSTEM_SKU: SKU=NotProvided;ModelName=PowerEdge R730xd

BIOS_VENDOR: Dell Inc.

BIOS_VERSION: 2.3.4

BIOS_DATE: 11/08/2016

BASEBOARD_MANUFACTURER: Dell Inc.

BASEBOARD_PRODUCT: 0WCJNT

BASEBOARD_VERSION: A04

DUMP_TYPE: 1

BUGCHECK_P1: 3

BUGCHECK_P2: ffffa48778febe20

BUGCHECK_P3: ffffc080258f4960

BUGCHECK_P4: ffff9c8fe2328010

DRVPOWERSTATE_SUBCODE: 3

FAULTING_THREAD: e23a8080

CPU_COUNT: 38

CPU_MHZ: 960

CPU_VENDOR: GenuineIntel

CPU_FAMILY: 6

CPU_MODEL: 4f

CPU_STEPPING: 1

CPU_MICROCODE: 6,4f,1,0 (F,M,S,R) SIG: B00001E’00000000 (cache) B00001E’00000000 (init)

DEFAULT_BUCKET_ID: WIN8_DRIVER_FAULT

continued…

BUGCHECK_STR: 0x9F

PROCESS_NAME: System

CURRENT_IRQL: 2

ANALYSIS_SESSION_HOST: PHALFORDPC

ANALYSIS_SESSION_TIME: 01-26-2017 10:07:27.0372

ANALYSIS_VERSION: 10.0.14321.1024 amd64fre

LAST_CONTROL_TRANSFER: from fffff800d1ce5f5c to fffff800d1dcf506

STACK_TEXT:

ffffc0802afcd6a0 fffff800d1ce5f5c : 0000000000000000 0000000000000001 ffffa48779d23801 fffff800d1d47359 : nt!KiSwapContext+0x76

ffffc0802afcd7e0 fffff800d1ce59ff : ffffa48770040100 0000000000000000 0000000000000000 fffff80000000000 : nt!KiSwapThread+0x17c

ffffc0802afcd890 fffff800d1ce77c7 : ffffc08000000000 fffff80d41a33a01 ffffa48770040130 0000000000000000 : nt!KiCommitThreadWait+0x14f

ffffc0802afcd930 fffff80d41a0aaba : ffffa487790a6c90 ffffa48700000000 fffff80d41a44000 ffffa48700000000 : nt!KeWaitForSingleObject+0x377

ffffc0802afcd9e0 fffff80d3b05debf : 0000000000000000 0000000000000006 ffffa48778fd3980 fffff80d3b428bf9 : mlx4eth63+0x4aaba

ffffc0802afcda30 fffff80d3b0f6f80 : ffffa48771c971a0 0000000000000000 ffff9c8fe2328010 0000000000000000 : NDIS!ndisMInvokeShutdown+0x53

ffffc0802afcda60 fffff80d3b0b910a : ffffa48771c971a0 0000000000000000 0000007ffffffff8 ffff9c8ec5249bb0 : NDIS!ndisMShutdownMiniport+0xb4

ffffc0802afcda90 fffff80d3b09d342 : 0000000000000000 0000000000000000 ffff9c8fe2328010 ffffa48771c971a0 : NDIS!ndisSetSystemPower+0x1bdc6

ffffc0802afcdb10 fffff80d3b01fc28 : ffff9c8fe2328010 ffffa48778febe20 ffff9c8fe2328200 ffffa48771c97050 : NDIS!ndisSetPower+0x96

ffffc0802afcdb40 fffff800d1d9a1c2 : ffff9c8fe23a8080 ffffc0802afcdbf0 fffff800d1f80600 ffffa48771c97050 : NDIS!ndisPowerDispatch+0xa8

ffffc0802afcdb70 fffff800d1c82729 : fffffffffa0a1f00 fffff800d1d99fe4 ffff9c8ec9cb8120 00000000000001d1 : nt!PopIrpWorker+0x1de

ffffc0802afcdc10 fffff800d1dcfbb6 : ffffc08025955180 ffff9c8fe23a8080 fffff800d1c826e8 0000000000000000 : nt!PspSystemThreadStartup+0x41

ffffc0802afcdc60 0000000000000000 : ffffc0802afce000 ffffc0802afc8000 0000000000000000 0000000000000000 : nt!KiStartSystemThread+0x16

continued…

STACK_COMMAND: .thread 0xffff9c8fe23a8080 ; kb

THREAD_SHA1_HASH_MOD_FUNC: b7cf6cc0234897f6fd93ad4ead1f75c9e7fd9df1

THREAD_SHA1_HASH_MOD_FUNC_OFFSET: 263f1d39481efd9f34c4df5786cc37534825cc6e

THREAD_SHA1_HASH_MOD: 1de60aba82b9f9b6af56a445a099815cd801e5d9

FOLLOWUP_IP:

mlx4eth63+4aaba

fffff80d41a0aaba 488d152f050300 lea rdx,[mlx4eth63+0x7aff0 (fffff80d41a3aff0)]

FAULT_INSTR_CODE: 2f158d48

SYMBOL_STACK_INDEX: 4

SYMBOL_NAME: mlx4eth63+4aaba

FOLLOWUP_NAME: MachineOwner

MODULE_NAME: mlx4eth63

IMAGE_NAME: mlx4eth63.sys

DEBUG_FLR_IMAGE_TIMESTAMP: 57c2dc3b

BUCKET_ID_FUNC_OFFSET: 4aaba

FAILURE_BUCKET_ID: 0x9F_3_POWER_DOWN_mlx4eth63!unknown_function

BUCKET_ID: 0x9F_3_POWER_DOWN_mlx4eth63!unknown_function

PRIMARY_PROBLEM_CLASS: 0x9F_3_POWER_DOWN_mlx4eth63!unknown_function

TARGET_TIME: 2017-01-26T09:54:25.000Z

OSBUILD: 14393

OSSERVICEPACK: 0

SERVICEPACK_NUMBER: 0

OS_REVISION: 0

SUITE_MASK: 400

PRODUCT_TYPE: 3

OSPLATFORM_TYPE: x64

OSNAME: Windows 10

OSEDITION: Windows 10 Server TerminalServer DataCenter SingleUserTS

OS_LOCALE:

USER_LCID: 0

OSBUILD_TIMESTAMP: 2016-12-21 06:50:57

BUILDDATESTAMP_STR: 161220-1747

BUILDLAB_STR: rs1_release

BUILDOSVER_STR: 10.0.14393.693.amd64fre.rs1_release.161220-1747

ANALYSIS_SESSION_ELAPSED_TIME: 6ba

ANALYSIS_SOURCE: KM

FAILURE_ID_HASH_STRING: km:0x9f_3_power_down_mlx4eth63!unknown_function

FAILURE_ID_HASH: {476104f0-13a3-bd96-8e08-ff1f10ccd888}

Followup: MachineOwner

continued…

This is another one…

Microsoft (R) Windows Debugger Version 10.0.14321.1024 AMD64

Copyright (c) Microsoft Corporation. All rights reserved.

Loading Dump File [D:\MEMORY.DMP]

Kernel Bitmap Dump File: Kernel address space is available, User address space may not be available.

Symbol search path is: srv*

Executable search path is:

Windows 10 Kernel Version 14393 MP (56 procs) Free x64

Product: Server, suite: TerminalServer DataCenter SingleUserTS

Built by: 14393.693.amd64fre.rs1_release.161220-1747

Machine Name:

Kernel base = 0xfffff80196a11000 PsLoadedModuleList = 0xfffff80196d16060

Debug session time: Fri Jan 20 16:16:45.177 2017 (UTC + 0:00)

System Uptime: 0 days 1:40:08.946

Loading Kernel Symbols

Loading User Symbols

Loading unloaded module list


  • Bugcheck Analysis *


Use !analyze -v to get detailed debugging information.

BugCheck 133, {1, 1e00, 0, 0}

Page 4200 not present in the dump file. Type “.hh dbgerr004” for details

Page 4200 not present in the dump file. Type “.hh dbgerr004” for details

Page 4200 not present in the dump file. Type “.hh dbgerr004” for details

Probably caused by : mrxsmb.sys ( mrxsmb!SmbWskSend+1f2 )

Followup: MachineOwner


53: kd> !analyze -v


  • Bugcheck Analysis *


DPC_WATCHDOG_VIOLATION (133)

The DPC watchdog detected a prolonged run time at an IRQL of DISPATCH_LEVEL

or above.

Arguments:

Arg1: 0000000000000001, The system cumulatively spent an extended period of time at

DISPATCH_LEVEL or above. The offending component can usually be

identified with a stack trace.

Arg2: 0000000000001e00, The watchdog period.

Arg3: 0000000000000000

Arg4: 0000000000000000

continued…

CPU_FAMILY: 6

CPU_MODEL: 4f

CPU_STEPPING: 1

CPU_MICROCODE: 6,4f,1,0 (F,M,S,R) SIG: B00001E’00000000 (cache) B00001E’00000000 (init)

DEFAULT_BUCKET_ID: WIN8_DRIVER_FAULT

BUGCHECK_STR: 0x133

PROCESS_NAME: System

CURRENT_IRQL: d

ANALYSIS_SESSION_HOST: PHALFORDPC

ANALYSIS_SESSION_TIME: 01-22-2017 02:23:17.0663

ANALYSIS_VERSION: 10.0.14321.1024 amd64fre

LAST_CONTROL_TRANSFER: from fffff80196bb1000 to fffff80196b5b6f0

STACK_TEXT:

ffffdb805a305d88 fffff80196bb1000 : 0000000000000133 0000000000000001 0000000000001e00 0000000000000000 : nt!KeBugCheckEx

ffffdb805a305d90 fffff80196adc7e8 : 00001b8b037a81b4 00001b8b037a7f29 fffff78000000320 fffff80196b57cc0 : nt! ?? ::FNODOBFM::`string’+0x46470

ffffdb805a305df0 fffff801972344e5 : ffffcb86adf28900 ffffcb86adf28900 0000000000000001 ffffcb86adf28900 : nt!KeClockInterruptNotify+0xb8

ffffdb805a305f40 fffff80196a685d6 : ffffdb8058adfd00 0000000000000000 0000000000000000 0000000000000000 : hal!HalpTimerClockIpiRoutine+0x15

ffffdb805a305f70 fffff80196b5cd6a : ffffdb80631d61d0 ffffd382b0228cf0 00000000000000b8 0000000000000008 : nt!KiCallInterruptServiceRoutine+0x106

ffffdb805a305fb0 fffff80196b5d1b7 : 0000000000000017 ffffdb80631d6278 ffffdb80631d65c0 ffffcb86b65aaa40 : nt!KiInterruptSubDispatchNoLockNoEtw+0xea

ffffdb80631d6150 fffff80196b61271 : ffffcb86bd353c4e fffff80196a7d923 0000000000000200 0000000000001000 : nt!KiInterruptDispatchNoLockNoEtw+0x37

ffffdb80631d62e0 fffff80196a7d923 : 0000000000000200 0000000000001000 ffffcb86bd353776 fffffeff00000000 : nt!ExpInterlockedPopEntrySListEnd+0x11

ffffdb80631d62f0 fffff800fc9aaf13 : ffffd383d366120c ffffcb86bbb68b40 ffffdb80631d6540 ffffcb86b09ada78 : nt!IoAllocateMdl+0x73

ffffdb80631d6340 fffff800fc9a9c61 : 000000003337ddbe ffffd383d378d260 0000000000000000 ffffd383d3661228 : tcpip!TcpSegmentTcbSend+0x223

ffffdb80631d6420 fffff800fc9abc8d : 0000000000000010 fffffff600000007 fffff800fcb54210 000000000028ed91 : tcpip!TcpBeginTcbSend+0x481

ffffdb80631d6710 fffff800fc9a95d5 : 0000000000000000 ffffcb86bbb68b40 0000000000001000 ffffdb80631d6b62 : tcpip!TcpTcbSend+0x25d

ffffdb80631d6ad0 fffff800fc9a929a : 000000000059dc59 ffffdb80631d6d60 ffffdb80631d6d01 0000000000000000 : tcpip!TcpEnqueueTcbSendOlmNotifySendComplete+0xa5

ffffdb80631d6b00 fffff800fc9a8ddb : ffffdb8000001300 ffffeb7f760fa0e8 ffffdb80631d6d01 fffff80196a9f581 : tcpip!TcpEnqueueTcbSend+0x30a

ffffdb80631d6c00 fffff80196a9f505 : ffffdb80631d6d01 ffffdb80631d6d00 ffffcb86b433f010 fffff800fc9a8db0 : tcpip!TcpTlConnectionSendCalloutRoutine+0x2b

ffffdb80631d6c80 fffff800fc9f1aa6 : ffffd383d1225010 0000000000000000 0000000000000000 ffffcb86b08e7530 : nt!KeExpandKernelStackAndCalloutInternal+0x85

ffffdb80631d6cd0 fffff800fc4e1d47 : ffffd383d1225010 ffffdb80631d6df0 0000000000000000 0000000000000000 : tcpip!TcpTlConnectionSend+0x76

ffffdb80631d6d40 fffff800fdb59b02 : ffffd383d1225010 fffff800fc4fd090 ffffdb80631d6df0 ffffcb86b433f010 : afd!AfdWskDispatchInternalDeviceControl+0xf7

ffffdb80631d6db0 fffff800fdb9c3d1 : ffffd383d0bcb720 ffffcb86b433f010 00000000c000020c fffff800fdbfad4e : mrxsmb!SmbWskSend+0x1f2

ffffdb80631d6ea0 fffff800fdb9c2b8 : ffffd383d4293eb8 fffff800fdb5a53b fffff800fdb8f000 0000000000000000 : mrxsmb!RxCeSend+0xe1

ffffdb80631d6ff0 fffff800fdb593dd : 0000000000040070 ffffd383d4293f28 ffffd383d0bcb720 ffffd383d4293eb8 : mrxsmb!VctSend+0x68

ffffdb80631d7040 fffff800fdbfbdc1 : ffffd383d4293d01 ffffd383d40e07f0 ffffd383d4293d28 0000000000000000 : mrxsmb!SmbCseSubmitBufferContext+0x33d

ffffdb80631d7110 fffff800fdb59f46 : ffffd383d4293d00 ffffdb80631d7200 ffffcb8600800000 0000000000000000 : mrxsmb20!Smb2Write_Start+0x1d1

ffffdb80631d71e0 fffff800fdc24126 : ffffdb80631d75a0 ffffd383ccdbd810 ffffcb86b436f7a0 0000000000000004 : mrxsmb!SmbCeInitiateExchange+0x376

ffffdb80631d7540 fffff800fc71755c : ffffd383d4293d28 0000000000000001 ffffd383ccdbd810 fffff80196a2e934 : mrxsmb20!MRxSmb2Write+0x126

ffffdb80631d75a0 fffff800fc72a37d : fffff800fc708000 ffffd383ccdbd810 ffffcb86bb2348b0 fffff800fc708000 : rdbss!RxLowIoSubmit+0x17c

ffffdb80631d7610 fffff800fc6e7a0c : 0000000000000003 0000000000000001 ffffcb86bb2348b0 ffffcb86bb2348b0 : rdbss!RxLowIoWriteShell+0x9d

ffffdb80631d7640 fffff800fc72a289 : 0000000000000000 ffffd383d44b8800 ffffcb86b0b1da40 0000000000000001 : rdbss!RxCommonFileWrite+0x74c

ffffdb80631d7830 fffff800fc6e299b : ffffd383ccdbd810 ffffcb86b48ed080 ffffcb86bb2348b0 0000000000000000 : rdbss!RxCommonWrite+0x59

ffffdb80631d7860 fffff800fc71e6e6 : ffffd383d44b8900 00000000000371fd 0000000000000000 0000000000000002 : rdbss!RxFsdCommonDispatch+0x55b

ffffdb80631d79e0 fffff800fdb990eb : 0000000000000000 fffff80196aa55bc 0000000000000000 ffffcb86ade77350 : rdbss!RxFsdDispatch+0x86

ffffdb80631d7a30 fffff800fb8f72e7 : ffffd383d2921600 0000000000000001 0000000000000102 ffffcb86bb2348b0 : mrxsmb!MRxSmbFsdDispatch+0xeb

ffffdb80631d7a70 fffff800fb8f65c8 : ffffd383d2d9d040 ffffdb80631d7ba0 0000000000040000 ffffd383d2921600 : clusport!ClusPortSendPassthruReadWriteRemote+0x227

ffffdb80631d7ac0 fffff800fb8f4f21 : ffffd383d2921600 ffffd383d2921600 ffffd383d2f0cbb0 ffffd383d2921701 : clusport!ClusPortExecuteIrp+0x118

ffffdb80631d7b70 fffff800fb8f4bfa : 0000000000000001 fffff800fb913a80 0000000000000000 ffffd383d2921760 : clusport!ClusPortIrpWorker+0x51

ffffdb80631d7ba0 fffff80196a13729 : 0000000000000000 ffffd383d44b8800 0000000000000080 fffff800fb8f4ae0 : clusport!CsvFsThreadPoolWorkerRoutine+0x11a

ffffdb80631d7c10 fffff80196b60bb6 : ffffdb8059fc0180 ffffd383d44b8800 fffff80196a136e8 0000000000000000 : nt!PspSystemThreadStartup+0x41

ffffdb80631d7c60 0000000000000000 : ffffdb80631d8000 ffffdb80631d2000 0000000000000000 0000000000000000 : nt!KiStartSystemThread+0x16

continued…

Debugging Details:


Page 4200 not present in the dump file. Type “.hh dbgerr004” for details

Page 4200 not present in the dump file. Type “.hh dbgerr004” for details

Page 4200 not present in the dump file. Type “.hh dbgerr004” for details

DUMP_CLASS: 1

DUMP_QUALIFIER: 401

BUILD_VERSION_STRING: 14393.693.amd64fre.rs1_release.161220-1747

SYSTEM_MANUFACTURER: Dell Inc.

SYSTEM_PRODUCT_NAME: PowerEdge R730xd

SYSTEM_SKU: SKU=NotProvided;ModelName=PowerEdge R730xd

BIOS_VENDOR: Dell Inc.

BIOS_VERSION: 2.3.4

BIOS_DATE: 11/08/2016

BASEBOARD_MANUFACTURER: Dell Inc.

BASEBOARD_PRODUCT: 0WCJNT

BASEBOARD_VERSION: A04

DUMP_TYPE: 1

BUGCHECK_P1: 1

BUGCHECK_P2: 1e00

BUGCHECK_P3: 0

BUGCHECK_P4: 0

DPC_TIMEOUT_TYPE: DPC_QUEUE_EXECUTION_TIMEOUT_EXCEEDED

CPU_COUNT: 38

CPU_MHZ: 960

CPU_VENDOR: GenuineIntel

continued…

STACK_COMMAND: kb

THREAD_SHA1_HASH_MOD_FUNC: 867e5a968da76728f7672cda902ce03b0094126c

THREAD_SHA1_HASH_MOD_FUNC_OFFSET: 0ca44a1f529d8537ce142132cfbf564122925c1a

THREAD_SHA1_HASH_MOD: 5f7fb32acfabea61ff02c84c0c3baa1fbe4b0b8d

FOLLOWUP_IP:

mrxsmb!SmbWskSend+1f2

fffff800`fdb59b02 8bd8 mov ebx,eax

FAULT_INSTR_CODE: 8b49d88b

SYMBOL_STACK_INDEX: 12

SYMBOL_NAME: mrxsmb!SmbWskSend+1f2

FOLLOWUP_NAME: MachineOwner

MODULE_NAME: mrxsmb

IMAGE_NAME: mrxsmb.sys

DEBUG_FLR_IMAGE_TIMESTAMP: 57cf9c38

BUCKET_ID_FUNC_OFFSET: 1f2

FAILURE_BUCKET_ID: 0x133_ISR_mrxsmb!SmbWskSend

BUCKET_ID: 0x133_ISR_mrxsmb!SmbWskSend

PRIMARY_PROBLEM_CLASS: 0x133_ISR_mrxsmb!SmbWskSend

TARGET_TIME: 2017-01-20T16:16:45.000Z

OSBUILD: 14393

OSSERVICEPACK: 0

SERVICEPACK_NUMBER: 0

OS_REVISION: 0

SUITE_MASK: 400

PRODUCT_TYPE: 3

OSPLATFORM_TYPE: x64

OSNAME: Windows 10

continued…

OSEDITION: Windows 10 Server TerminalServer DataCenter SingleUserTS

OS_LOCALE:

USER_LCID: 0

OSBUILD_TIMESTAMP: 2016-12-21 06:50:57

BUILDDATESTAMP_STR: 161220-1747

BUILDLAB_STR: rs1_release

BUILDOSVER_STR: 10.0.14393.693.amd64fre.rs1_release.161220-1747

ANALYSIS_SESSION_ELAPSED_TIME: e45

ANALYSIS_SOURCE: KM

FAILURE_ID_HASH_STRING: km:0x133_isr_mrxsmb!smbwsksend

FAILURE_ID_HASH: {f4239d18-f80c-7c1f-6289-34a57aa17a7d}

Followup: MachineOwner


continued…

Can somebody at Mellanox please help? I’m tempted to buy two off the shelf Mellanox cards just to rule Dell’s firmware out of the equation.

StorageSpacesSetup - Network.txt.zip

1.2 KB

StorageSpacesSetup - Cluster.txt.zip

889 bytes

reply

Hi Thorir,

I managed to fix the issue. I had a support case open with Dell ProSupport for about 3 weeks. They too had issues trying to replicate the fault. I suggested the firmware was out of sync with the drivers they’d released. Anyway, they said try BIOS settings. I then spent the next 3 weeks reinstalling windows over and over because it would corrupt the install of Windows on occasion because of the BSOD’s.

In the end I was able to resolve the issue. There’s a BIOS setting IO Non Posted Prefetching. This was enabled by default on delivery of the servers. I disabled this setting and was able to run VMFleet for a few days hammering the system with no crashes. I fed this info back to Dell who then closed the case. They did acknowledge the firmware is a problem but said they can’t do anything about it other than raise a case for it to be updated. We just have to wait.

I think I’d buy Mellanox cards directly from Mellanox in future. I can’t see a way of upgrading the firmware as the firmware tools don’t recognise the cards at all. There’s no way to discover them because Dell have changed the identifiers the MFT’s look for. Mellanox was very unhelpful as I tried to raise a case with them, only to be told I don’t have support. Pretty annoyed at the time. Dell won’t give me a time or date for firmware or even if it’s on the cards. Mellanox did not want to know unless I paid more. Anyway, I hope this helps others.

May I add. The servers have been running fine for about a month and now we’re experiencing similar crashes again (not as often). This time Microsoft have a case open as I believe the mellanox side of things are sorted. Who knows, Microsoft might turn around and say there’s a firmware + driver mismatch on the Mellanox cards. It’s been a nightmare.

Anyway, I hope that BIOS setting helps others.

reply…

Sorry I forgot to mention…

You said you sorted it. How did you resolve it? Was it a BIOS setting or did you manage to make your own firmware? If that’s the case, maybe my more recent crashes are still related…

Many thanks

Hi,

I have been in having this same issue and have resolved it with help of Mellanox and pounding Dell ProSupport.

The problem is that accourding to Mellanox the supporting config for Mellanox ConnectX-3 is driver version 5.25 and firmware 2.36.5150 or higher, Dell has the driver but their firmware version is 2.36.5080 which is nearly year old.

Mellanox wants you to talk to Dell support (which is right) to fix this. If you have a Dell ProSupport then contact them but if not go to this website, OEM Firmware Downloads OEM Firmware Downloads (Dell card = MCX312A-XCB - MT_1080120023), and follow the instruction to create your own firmware image.

Kind Regards,

Thorir

Hi,

Please contact Mellanox support on this.

you can email support@mellanox.com mailto:support@mellanox.com

Ophir.

Sorry I forgot to mention…

You said you sorted it. How did you resolve it? Was it a BIOS setting or did you manage to make your own firmware? If that’s the case, maybe my more recent crashes are still related…

Many thanks

Hi Thorir,

I managed to fix the issue. I had a support case open with Dell ProSupport for about 3 weeks. They too had issues trying to replicate the fault. I suggested the firmware was out of sync with the drivers they’d released. Anyway, they said try BIOS settings. I then spent the next 3 weeks reinstalling windows over and over because it would corrupt the install of Windows on occasion because of the BSOD’s.

In the end I was able to resolve the issue. There’s a BIOS setting IO Non Posted Prefetching. This was enabled by default on delivery of the servers. I disabled this setting and was able to run VMFleet for a few days hammering the system with no crashes. I fed this info back to Dell who then closed the case. They did acknowledge the firmware is a problem but said they can’t do anything about it other than raise a case for it to be updated. We just have to wait.

I think I’d buy Mellanox cards directly from Mellanox in future. I can’t see a way of upgrading the firmware as the firmware tools don’t recognise the cards at all. There’s no way to discover them because Dell have changed the identifiers the MFT’s look for. Mellanox was very unhelpful as I tried to raise a case with them, only to be told I don’t have support. Pretty annoyed at the time. Dell won’t give me a time or date for firmware or even if it’s on the cards. Mellanox did not want to know unless I paid more. Anyway, I hope this helps others.

May I add. The servers have been running fine for about a month and now we’re experiencing similar crashes again (not as often). This time Microsoft have a case open as I believe the mellanox side of things are sorted. Who knows, Microsoft might turn around and say there’s a firmware + driver mismatch on the Mellanox cards. It’s been a nightmare.

Anyway, I hope that BIOS setting helps others.