Windows GPU Scheduler + GeForce Driver occasional bugcheck on startup

Looking for appropriate channel to discuss, gain insight, and hopefully resolve this. Have opened conversation DirectX team to involve their colleagues as well. Machine is a-typical, dual socket Xeon 6154 w/ 192GB ECC and 3090. This is new behavior, start recently mid/late February, 2021.

Would be helpful if Nvidia recommends disabling the GPU scheduler. That said, I’d prefer to surface whatever issues are present and get them fixed for all.

Regards,
Jamie

Debugging Details:

KEY_VALUES_STRING: 1

Key : Analysis.CPU.Sec
Value: 4

Key : Analysis.DebugAnalysisProvider.CPP
Value: Create: 8007007e on HULK

Key : Analysis.DebugData
Value: CreateObject

Key : Analysis.DebugModel
Value: CreateObject

Key : Analysis.Elapsed.Sec
Value: 10

Key : Analysis.Memory.CommitPeak.Mb
Value: 186

Key : Analysis.System
Value: CreateObject

BUGCHECK_CODE: 119
BUGCHECK_P1: 2
BUGCHECK_P2: ffffffffc000000d
BUGCHECK_P3: fffff08a954ff220
BUGCHECK_P4: ffff9406c791b590
BLACKBOXBSD: 1 (!blackboxbsd)
BLACKBOXNTFS: 1 (!blackboxntfs)
BLACKBOXPNP: 1 (!blackboxpnp)
BLACKBOXWINLOGON: 1

PROCESS_NAME: System

STACK_TEXT:
fffff08a954ff148 fffff8062d0b3ad0 : 0000000000000119 0000000000000002 ffffffffc000000d fffff08a954ff220 : nt!KeBugCheckEx
fffff08a954ff150 fffff8062f8bb3cb : 0000000000000000 ffff9406c7918000 ffff9406cf7db620 ffff9406c7918466 : watchdog!WdLogEvent5_WdCriticalError+0xe0
fffff08a954ff190 fffff8062f924b6d : ffff940600000000 ffff9406c791b590 ffff9406cb029000 ffff9406c6b44010 : dxgmms2!VidSchiSendToExecutionQueue+0x1306b
fffff08a954ff2c0 fffff8062f92cd1a : ffff9406c6b44010 ffff9406cb029000 0000000000000000 ffff9406c791f620 : dxgmms2!VidSchiSubmitPagingCommand+0x2ed
fffff08a954ff440 fffff8062f92cb8a : ffff9406cb029400 fffff8062f92cac0 ffff9406cb029000 ffffb2804f105100 : dxgmms2!VidSchiRun_PriorityTable+0x17a
fffff08a954ff490 fffff80610717e55 : ffff9406cf9f21c0 fffff80600000001 ffff9406cb029000 001fa47fb19bbfff : dxgmms2!VidSchiWorkerThread+0xca
fffff08a954ff4d0 fffff806107fd278 : ffffb2804f105180 ffff9406cf9f21c0 fffff80610717e00 43bbd7882a38382b : nt!PspSystemThreadStartup+0x55
fffff08a954ff520 0000000000000000 : fffff08a95500000 fffff08a954f9000 0000000000000000 0000000000000000 : nt!KiStartSystemThread+0x28

SYMBOL_NAME: dxgmms2!VidSchiSendToExecutionQueue+1306b
MODULE_NAME: dxgmms2
IMAGE_NAME: dxgmms2.sys
IMAGE_VERSION: 10.0.19041.844
STACK_COMMAND: .thread ; .cxr ; kb
BUCKET_ID_FUNC_OFFSET: 1306b
FAILURE_BUCKET_ID: 0x119_2_DRIVER_FAILED_SUBMIT_COMMAND_dxgmms2!VidSchiSendToExecutionQueue
OS_VERSION: 10.0.19041.1
BUILDLAB_STR: vb_release
OSPLATFORM_TYPE: x64
OSNAME: Windows 10
FAILURE_ID_HASH: {9a11bf9c-270e-962e-7a82-3efdab93c10e}

16: kd> lmvm dxgmms2
Browse full module list
start end module name
fffff8062f8a0000 fffff8062f982000 dxgmms2 (pdb symbols) C:\symbols\dxgmms2.pdb\3799A71AD52A8029174C94D9FA8429001\dxgmms2.pdb
Loaded symbol image file: dxgmms2.sys
Image path: \SystemRoot\System32\drivers\dxgmms2.sys
Image name: dxgmms2.sys
Browse all global symbols functions data
Image was built with /Brepro flag.
Timestamp: 526AED69 (This is a reproducible build file hash, not a timestamp)
CheckSum: 000E0FA7
ImageSize: 000E2000
File version: 10.0.19041.844
Product version: 10.0.19041.844
File flags: 0 (Mask 3F)
File OS: 40004 NT Win32
File type: 3.7 Driver
File date: 00000000.00000000
Translations: 0409.04b0
Information from resource tables:
CompanyName: Microsoft Corporation
ProductName: Microsoft® Windows® Operating System
InternalName: dxgmms2.sys
OriginalFilename: dxgmms2.sys
ProductVersion: 10.0.19041.844
FileVersion: 10.0.19041.844 (WinBuild.160101.0800)
FileDescription: DirectX Graphics MMS
LegalCopyright: © Microsoft Corporation. All rights reserved.

According to MS their side was possibly fixed ‘sometime ago’ and what’s needed is updated Nvidia drivers. Since the crash is occurring with latest released Nvidia drivers, 461.72 at the time of this writing, hopefully future revision will include the fix.

Regards,
Jamie

Hi,

I see the exact same thing here. A new machine (Ryzen 5950x, RTX 3090, 128GB of RAM).
When opening Netflix, the whole computer would practically freeze followed by that exact bugcheck, the same exact call stack, and the same first two arguments (BUGCHECK_P1, BUGCHECK_P2). I do have GPU scheduling enabled. My screen is configured to operate in 98Hz. Windows build is 20H2 (19042.870), NVIDIA driver version is 461.92. It is noteworthy that this does not happen every time.

Further, this has recently been documented here as well:

Google Translate (the post being in Turkish) yields that the OP there “opened Netflix and clicked on a movie available in 4k […]. The movie stayed on the black screen for a while and then the mouse started to stall. Finally, it reset with a blue screen”, which is the same exact thing I had. The bugcheck, first two arguments and call stack seem to coincide. The driver version is 461.92, too.

The same bugcheck (along with first two arguments and call stack) was reported here, too:

Could anyone at NVIDIA have a look at this?

Hey, @_Jamie, were you able to get to the bottom of this?
Thanks!

Have not experienced this bugcheck for at least a week or more.

OS.Build 19042.906
Geforce Driver: 465.89

Check occurred intermittently during boot, and only 3000’s cards. Colleague of mine experienced it once on his 3080. Rest of team uses 2080’s and never observed it.

For me, this bugcheck (the exact same call stack and the exact same subtype and NTSTATUS) has had to do with launching the Netflix app, none of my other machines (one with two TITANs in SLI and the other with a 2080Ti) have exhibited this issue. But similarly, it’s specific only to the 3090 machine.

I’ve been exchanging e-mails with driverfeedback@nvidia.com about this. They are responsive, perhaps you could write them too.

P.S. In your initial post you said “Would be helpful if Nvidia recommends disabling the GPU scheduler”. Did these crashes stop once you’ve disabled “Hardware accelerated GPU scheduling”?

Thanks!

I did not disable GPU schedule in Windows 10. Bug check was maybe 1 in 5 boots, always cold boots. Bug check followed by power cycle, reboot, etc. never bug checked a second time.

Wild ass guess is interplay between how the machine was shutdown, specifically apps using video encode/decode, if not closed prior to shutdown would upon startup get relaunched early. Also the interaction between that the driver, power saving features of displays (perhaps, DP or HDMI specific), etc, might be problematic. Have observed specific vendor displays power saving features cause problems like resetting Windows layout from sleep; disabling these features solves that. The machine I’m using has none of those displays, so I’d expect the former (early start up of video encode/decode apps). Don’t have good insight here, so I’m only guessing.