Performance drop + crashes w/ Kernel Security Check Failure after driver update, Windows 10

Dear community,

I’m working on a CUDA .Net project. I want to use CUDA functions in a .Net environment for CAD tools.

My VS2017 solution file therefore contains three projects:

core.dll - main core in CUDA C/C++, target platform version 10.0.16299.0, platform toolset VS2015(v140)
bridge.dll - C++/CLI layer to expose core.dll functions to .Net, same target version, toolset VS2017(v141). .Net platform 4.5
use.dll - C# library depending on bridge.dll as the primary access point, my CAD application only calls use.dll functions

I only build 64bit atm.

System info:
Windows 10 Education
Version 1809
OS Build 17744.1003

Intel Core i7-7700K @ 4.20GHz
32 GB RAM

GPU: Nvidia GTX 1080, CUDA toolset 9.2

I have the following problem:

Until about two weeks ago everything was working very well and I had lightning fast code running. I was using Nvidia driver 388.59. Then suddenly (I think after a Windows update but not 100% sure), performance dropped significantly. My code was running 200x slower than before. I decided to update the driver to 397.64 and got my lighting fast results again. However, now the system is very unstable and I regularly get the Windows green screen of death with the exception “Kernel Security Check Failure”.

When I downgrade my driver again to 388.59 the system is stable again but painfully slow.

A few remarks:
Before the problem occured I was surprised to find out, that my code ran much much faster when I called core.dll functions from a .Net application (via bridge.dll) than when calling it from a C++ application directly referencing core.dll. I assumed the reason was that when the Common Language Runtime CLR is used, memory pointers are passed directly to the application, whereas C++ applications perform full data copies at each iteration. It seems that as soon as my problem appeared, this speed advantage of .Net was gone, unless I use a newer Nv driver, which is unstable.

My question:
Were there any major changes in how memory is exposed to Windows between driver updates 388.59 to 397.64?
How exactly is memory exposed when MSIL somes into play (I hope that’s not off-topic here)?

Thanks a lot in advance
Ben

UPDATE:
The crashes might be related to system threads. If I try to retrieve data from the GPU from a different windows thread than the normal main thread (using cudaDeviceSynchronize() and cudaStreamSynchronize() as fences), crashes seem to appear more frequently

Any information regarding the underlaying memory transfer related to C++/CLI and calling from different system threads would be helpful

I would suggest trying driver 399.07 or newer.

Hey Bob,

I did. 399.07 behaves the same as 397.64 in that regard. I wanted to mention that I tried earlier drivers to point out, that the performance difference appears somewhere between drivers 388.59 and 397.64

I didn’t try all drivers in between though, to find out exactly which driver update causes the different behavior.

If it helps I could do that…

FYI:
This is what WhoCrashed told me from the Minidump .dmp files…

System Information (local)

Computer name: ******
Windows version: Windows 10 , 10.0, build: 17744
Windows dir: C:\WINDOWS
Hardware: ASUSTeK COMPUTER INC., PRIME Z270-A
CPU: GenuineIntel Intel(R) Core™ i7-7700K CPU @ 4.20GHz Intel586, level: 6
8 logical processors, active mask: 255
RAM: 34286788608 bytes total


Crash Dump Analysis

Crash dumps are enabled on your computer.

Crash dump directories:
C:\WINDOWS
C:\WINDOWS\Minidump

On Wed 05-Sep-18 16:08:29 your computer crashed or a problem was reported
crash dump file: C:\WINDOWS\Minidump\090518-5593-01.dmp
This was probably caused by the following module: ntoskrnl.exe (nt+0x1C6D50)
Bugcheck code: 0x139 (0x3, 0xFFFF87842C6C6D40, 0xFFFF87842C6C6C98, 0x0)
Error: KERNEL_SECURITY_CHECK_FAILURE
file path: C:\WINDOWS\system32\ntoskrnl.exe
product: Microsoft® Windows® Operating System
company: Microsoft Corporation
description: NT Kernel & System
Bug check description: The kernel has detected the corruption of a critical data structure.
The crash took place in the Windows kernel. Possibly this problem is caused by another driver that cannot be identified at this time.

On Wed 05-Sep-18 16:08:29 your computer crashed or a problem was reported
crash dump file: C:\WINDOWS\MEMORY.DMP
This was probably caused by the following module: nvlddmkm.sys (nvlddmkm+0x6BA55)
Bugcheck code: 0x139 (0x3, 0xFFFF87842C6C6D40, 0xFFFF87842C6C6C98, 0x0)
Error: KERNEL_SECURITY_CHECK_FAILURE
file path: C:\WINDOWS\System32\DriverStore\FileRepository\nv_dispi.inf_amd64_c1a085cc86772d3f\nvlddmkm.sys
product: NVIDIA Windows Kernel Mode Driver, Version 391.35
company: NVIDIA Corporation
description: NVIDIA Windows Kernel Mode Driver, Version 391.35
Bug check description: The kernel has detected the corruption of a critical data structure.
A third party driver was identified as the probable root cause of this system error. It is suggested you look for an update for the following driver: nvlddmkm.sys (NVIDIA Windows Kernel Mode Driver, Version 391.35 , NVIDIA Corporation).
Google query: nvlddmkm.sys NVIDIA Corporation KERNEL_SECURITY_CHECK_FAILURE

On Wed 05-Sep-18 12:16:53 your computer crashed or a problem was reported
crash dump file: C:\WINDOWS\Minidump\090518-5640-01.dmp
This was probably caused by the following module: ntoskrnl.exe (nt+0x1C6D50)
Bugcheck code: 0x139 (0x3, 0xFFFFD587E7D42D40, 0xFFFFD587E7D42C98, 0x0)
Error: KERNEL_SECURITY_CHECK_FAILURE
file path: C:\WINDOWS\system32\ntoskrnl.exe
product: Microsoft® Windows® Operating System
company: Microsoft Corporation
description: NT Kernel & System
Bug check description: The kernel has detected the corruption of a critical data structure.
The crash took place in the Windows kernel. Possibly this problem is caused by another driver that cannot be identified at this time.

I’m shamelessly bumping the topic myself again,
anyone has an idea?

Minor update:
The error still occurs with CUDA 10.0 Driver 411.31 under Windows 10 Enterprise 1803 17134.345
Running two GTX1080 Ti with SLI disabled

Although it doesn’t appear often now, only when debugging in VS2017 (so far)

I don’t have any further information. If you can provide a deterministic set of steps that is guaranteed to demonstrate the issue, I suggest filing a bug at developer.nvidia.com