We have 3 nodes cluster and all the 3 nodes were crashed and generated Dump files. Looking at the crash error found that all the 3 nodes were crashed with the same error code.
vGPU was enabled for all the 3 nodes.
This are the crash dumps details;
VIDEO_TDR_FAILURE (116)
Attempt to reset the display driver and recover from timeout failed.
Arguments:
Arg1: ffff8d03a76a5010, Optional pointer to internal TDR recovery context (TDR_RECOVERY_CONTEXT).
Arg2: fffff80d3a752678, The pointer into responsible device driver module (e.g. owner tag).
Arg3: ffffffffc000009a, Optional error code (NTSTATUS) of the last failed operation.
Arg4: 0000000000000004, Optional internal context dependent data.
Debugging Details:
TRIAGER: Could not open triage file : e:\dump_analysis\program\triage\modclass.ini, error 2
FAULTING_IP:
nvlddmkm+982678
fffff80d3a752678 48ff25a145e7ff jmp qword ptr [nvlddmkm+0x7f6c20 (fffff80d
3a5c6c20)]
DEFAULT_BUCKET_ID: GRAPHICS_DRIVER_TDR_FAULT
BUGCHECK_STR: 0x116
Child-SP RetAddr Call Site
00 ffff8a00aaa17a58 fffff806
44b3a298 nt!KeBugCheckEx
01 ffff8a00aaa17a60 fffff806
44b1d13f dxgkrnl!TdrBugcheckOnTimeout+0xec
02 ffff8a00aaa17aa0 fffff806
44b1a2ef dxgkrnl!ADAPTER_RENDER::Reset+0x153
03 ffff8a00aaa17ad0 fffff806
44b39a85 dxgkrnl!DXGADAPTER::Reset+0x307
04 ffff8a00aaa17b20 fffff806
44b39bc7 dxgkrnl!TdrResetFromTimeout+0x15
05 ffff8a00aaa17b50 fffff802
e8ae2599 dxgkrnl!TdrResetFromTimeoutWorkItem+0x27
06 ffff8a00aaa17b80 fffff802
e8b32965 nt!ExpWorkerThread+0xe9
07 ffff8a00aaa17c10 fffff802
e8bd0e26 nt!PspSystemThreadStartup+0x41
08 ffff8a00aaa17c60 00000000
00000000 nt!KiStartSystemThread+0x16
02 ffff8a00aaa17aa0 fffff806
44b1a2ef dxgkrnl!ADAPTER_RENDER::Reset+0x153
- All 3 crash dump points to same stack and register value.
FAULTING_IP:
nvlddmkm+982678
fffff80d3a752678 48ff25a145e7ff jmp qword ptr [nvlddmkm+0x7f6c20 (fffff80d
3a5c6c20)] - Windbg stack points to VIDEO_TDR_FAILURE (116).
37: kd> !analyze -v
>#*******************************************************************************
>#* Bugcheck Analysis *
>#*******************************************************************************
VIDEO_TDR_FAILURE (116)
Attempt to reset the display driver and recover from timeout failed.
Arguments:
Arg1: ffffdd84719ea010, Optional pointer to internal TDR recovery context (TDR_RECOVERY_CONTEXT).
Arg2: fffff80fe60e2678, The pointer into responsible device driver module (e.g. owner tag).
Arg3: ffffffffc000009a, Optional error code (NTSTATUS) of the last failed operation.
Arg4: 0000000000000004, Optional internal context dependent data.
As per Microsoft documentation this is cause by following reasons
https://docs.microsoft.com/en-us/windows-hardware/drivers/debugger/bug-check-0x116---video-tdr-error
Refer to Resolution Section
*Over-clocked components, such as the motherboard
*Incorrect component compatibility and settings (especially memory configuration and timings)
*Defective parts (memory modules, motherboards, etc.)
*Insufficient system power
*Insufficient system cooling
We are using the HP Servers with following specification;
HP ProLiant DL380 Gen 9, and the ROM version is P89 v2.30 (09/13/2016).
And moreover when we tried to upgrade the drivers to the latest version 385.54 Release Date: 25.9.2017 they we were unable to run virtual GPU (Remote FX) as GPU does not show in the HyperV setting. Once we reverted to old driver 376.84, we could see physical GPUs under Hyper-V settings.
Can any tell if someone has experience the same issue with the Driver version?