HP Server Crashed with Tesla M60,376.84 driver for windows server 2016 (GPU)

We have 3 nodes cluster and all the 3 nodes were crashed and generated Dump files. Looking at the crash error found that all the 3 nodes were crashed with the same error code.
vGPU was enabled for all the 3 nodes.

This are the crash dumps details;

VIDEO_TDR_FAILURE (116)
Attempt to reset the display driver and recover from timeout failed.
Arguments:
Arg1: ffff8d03a76a5010, Optional pointer to internal TDR recovery context (TDR_RECOVERY_CONTEXT).
Arg2: fffff80d3a752678, The pointer into responsible device driver module (e.g. owner tag).
Arg3: ffffffffc000009a, Optional error code (NTSTATUS) of the last failed operation.
Arg4: 0000000000000004, Optional internal context dependent data.
Debugging Details:

TRIAGER: Could not open triage file : e:\dump_analysis\program\triage\modclass.ini, error 2
FAULTING_IP:
nvlddmkm+982678
fffff80d3a752678 48ff25a145e7ff jmp qword ptr [nvlddmkm+0x7f6c20 (fffff80d3a5c6c20)]
DEFAULT_BUCKET_ID: GRAPHICS_DRIVER_TDR_FAULT
BUGCHECK_STR: 0x116

Child-SP RetAddr Call Site
00 ffff8a00aaa17a58 fffff80644b3a298 nt!KeBugCheckEx
01 ffff8a00aaa17a60 fffff80644b1d13f dxgkrnl!TdrBugcheckOnTimeout+0xec
02 ffff8a00aaa17aa0 fffff80644b1a2ef dxgkrnl!ADAPTER_RENDER::Reset+0x153
03 ffff8a00aaa17ad0 fffff80644b39a85 dxgkrnl!DXGADAPTER::Reset+0x307
04 ffff8a00aaa17b20 fffff80644b39bc7 dxgkrnl!TdrResetFromTimeout+0x15
05 ffff8a00aaa17b50 fffff802e8ae2599 dxgkrnl!TdrResetFromTimeoutWorkItem+0x27
06 ffff8a00aaa17b80 fffff802e8b32965 nt!ExpWorkerThread+0xe9
07 ffff8a00aaa17c10 fffff802e8bd0e26 nt!PspSystemThreadStartup+0x41
08 ffff8a00aaa17c60 0000000000000000 nt!KiStartSystemThread+0x16

02 ffff8a00aaa17aa0 fffff80644b1a2ef dxgkrnl!ADAPTER_RENDER::Reset+0x153

  1. All 3 crash dump points to same stack and register value.
    FAULTING_IP:
    nvlddmkm+982678
    fffff80d3a752678 48ff25a145e7ff jmp qword ptr [nvlddmkm+0x7f6c20 (fffff80d3a5c6c20)]
  2. Windbg stack points to VIDEO_TDR_FAILURE (116).
    37: kd> !analyze -v
    >#*******************************************************************************
    >#* Bugcheck Analysis *
    >#*******************************************************************************
    VIDEO_TDR_FAILURE (116)
    Attempt to reset the display driver and recover from timeout failed.
    Arguments:
    Arg1: ffffdd84719ea010, Optional pointer to internal TDR recovery context (TDR_RECOVERY_CONTEXT).
    Arg2: fffff80fe60e2678, The pointer into responsible device driver module (e.g. owner tag).
    Arg3: ffffffffc000009a, Optional error code (NTSTATUS) of the last failed operation.
    Arg4: 0000000000000004, Optional internal context dependent data.

As per Microsoft documentation this is cause by following reasons
https://docs.microsoft.com/en-us/windows-hardware/drivers/debugger/bug-check-0x116---video-tdr-error
Refer to Resolution Section
*Over-clocked components, such as the motherboard
*Incorrect component compatibility and settings (especially memory configuration and timings)
*Defective parts (memory modules, motherboards, etc.)
*Insufficient system power
*Insufficient system cooling

We are using the HP Servers with following specification;
HP ProLiant DL380 Gen 9, and the ROM version is P89 v2.30 (09/13/2016).

And moreover when we tried to upgrade the drivers to the latest version 385.54 Release Date: 25.9.2017 they we were unable to run virtual GPU (Remote FX) as GPU does not show in the HyperV setting. Once we reverted to old driver 376.84, we could see physical GPUs under Hyper-V settings.

Can any tell if someone has experience the same issue with the Driver version?

Hi Venky,

As you are running RemoteFX on Tesla M60 I assume you have the required vPC licenses so please open a support ticket with ESP. You should run the supported driver from GRID5.0 package (R384 branch).

Regards

Simon

Hello Simon,

Thanks for getting back on this.
We dnt use any licenses as we just use the driver from nvidia.com for any GRID software or something like that.
We happen to use the RemoteFX with previous version of Tesla M60 drivers. Lately we observed some crashes and thought to update the driver version and bumped into issue as we were not able to use the vGPU.

Regards,
Venky

Hi Venky,

so please check our Licensing/EULA as you need to buy licenses for your deployment with RemoteFX and Tesla M60!
See here:

And btw it doesn’t matter what driver you’re using.

Regards

Simon