Can cuda-memcheck disturb multi-threaded multi-gpu CUDA applications' synchronization structure?

tugrul_192bit · March 20, 2018, 3:08pm

While I’m profiling my application, I observed this:

kernel is very simple that just adds to each array element their own thread-id(which is also array index).(async launch)
all cuda commands (nvrtc, driver) are wrapped with proper error-checking and no error is returned.
it is run on two quadro k420 cards concurrently with split array feed with just async mem copy.
compute returns true results when checked against expected results on host side.
visual studio’s debugging tool doesn’t detect any host-side memory leak.
nsight doesn’t detect anything wrong and shows that each k420 works as expected.
works 100% of time, for both debug and release modes
windows watchdog timer is not triggered as each kernel takes several microseconds only, even with cuda-memcheck
crashes randomly(~25% of time) or always(–force-blocking-launches yes) when used with cuda-memcheck
cuda-memcheck doesn’t find any error while doing this but says “process didn’t terminate successfully”(ofcourse only when application fails to complete its work).

windows 10 home, drivers 390.x, toolkit 9.1, 2xk420 (cuda compute capability 3.0), codes: driver api + nvrtc + 1 context per gpu, 1 thread per context

What should I do when debugging leads to “cuLaunchKernel” command and shows all parameters are non-zero but the exception message says

“Unhandled exception at 0x00007FFF87A6CD00 (nvcuda.dll) in myApp.exe: 0xC0000005: Access violation reading location 0x0000000000000000.”

which means something in nvcuda.dll tried to read zero pointer. There are also controls around this cuLaunchKernel against zero, they don’t return any error message neighter (they all have their own memory space in heap so they don’t become zero with scope issues).

documentation also says these commands can return any error from other asynchronous calls. Yes, I’m using asynchronous mem copies between host and device too so I disabled array copies but exception persists.

Also why does it crash 100% of time when I force kernels synchrony(“–force-blocking-launches yes”)? How would two cards be forced to wait each other while myApp is in control of multi-thread logic? Does it simply block kernel call?

What could be causing “blocking” driver api kernel launch to read a “zero pointer” while async one doesn’t? Does it hold the kernel launch on “driver-side” while myApp continues its threading? This would be bad for me then. Especially if it does same thing randomly even without “–force-blocking”. Is there a rule of thumb that says everything should be serialized before using any memory leak or cuda leak detection tool?

Also maybe not so important one to list but avira antivirus finds PUA/bitcoinminer.gen7(cloud) in myApp whenever I try to run it (so I put it in avira-exception list so it doesn’t scan it anymore) I always delete exe file, then build project a new. Does this mean visual studio is injecting things in it? But Nvidia’s nvrtc example runs fine without avira’s intervention. If something is wrong with runtime-compiling, should I re-check my cuda toolkit or drivers against infection? Why would avira finds myApp different even though it gets compiled on same visual studio instance, same machine, same drivers as Nvidia’s nvrtc sample project?

Robert_Crovella · March 20, 2018, 3:45pm

cuda-memcheck perturbs execution order of both warps and threadblocks. Therefore, race conditions that are less evident normally may become more evident when running an app that way. None of these perturbations is in violation of the CUDA execution model, so an app that fails in any way under cuda-memcheck may be suspect.

Furthermore, running an app under cuda-memcheck usually makes the kernel runtimes much longer, possibly 10x or longer in many cases. Therefore, an app that is not triggering the windows WDDM TDR system may trigger it when run under cuda-memcheck, due to the increase in kernel runtime.

Have you modified or disabled the WDDM TDR watchdog?

tugrul_192bit · March 20, 2018, 3:49pm

I haven’t modified WDDM TDR watchdog. Also can it make a microsecond kernel a “second” kernel? Okay, I’ll change the value. It should’ve been 7 seconds iirc but app crash was in 3-4 seconds.

tugrul_192bit · March 20, 2018, 3:56pm

I increased TdrDelay from 8 seconds to 80 seconds, still it fails in first 2-3 seconds with --force-blocking option.

tugrul_192bit · March 20, 2018, 4:19pm

For now, trying to narrow down the area it causes crash, by adding std::cout at places. Just figured it is not crashing 100%, it crashes 99.99% of time, by seeing it crashing after several thousand calls, not at first call. There are 30k calls in total.

tugrul_192bit · March 20, 2018, 4:30pm

Does this documentation help on this issue? It says

“Kernel launches larger than 16MB are not currently supported by CUDA‐MEMCHECK and may return erroneous results.”

How can I know how much does my kernel launch needs? Do they add up together when they are launched one after another? Are they on the level of kB range each?

Robert_Crovella · March 20, 2018, 4:46pm

That appears to be coming from some old documentation: (2012)

https://developer.download.nvidia.com/compute/DevZone/docs/html/C/doc/cuda-memcheck.pdf

The current cuda-memcheck documentation is here:

http://docs.nvidia.com/cuda/cuda-memcheck/index.html

I couldn’t find that reference there. Unless you are using a CUDA version from ~2012 (say CUDA 6 or before) I would disregard that from the old doc. It’s always good practice to use the latest CUDA versions and refer to the latest docs, which are found here:

http://docs.nvidia.com/cuda/index.html

Topic		Replies	Views
Kernel problem, execution stop after ~15min CUDA Programming and Performance	7	1788	November 4, 2016
The Cuda 5 Second execution-time limit Finding a the way to work around the GDI timeout CUDA Programming and Performance	24	12731	July 26, 2010
Illegal memory access crash CUDA Programming and Performance	15	4538	January 30, 2022
Silent kernel failure CUDA Programming and Performance	25	8315	May 18, 2020
cuda-memcheck hangs the whole system CUDA Programming and Performance	14	4417	December 31, 2015
Cuda application crashes works fine for small data and crashes for big data CUDA Programming and Performance	3	414	October 12, 2021
Simple CUDA program hitting size limits/errors on Windows but not Linux CUDA Programming and Performance	23	1920	January 12, 2019
Can kernel function parallel with CPU code? CUDA Programming and Performance	12	7737	December 5, 2008
How to debug kernel throwing an exception? CUDA Programming and Performance	16	7954	June 14, 2013
CUDA Kernel Crash CUDA Programming and Performance	13	4640	January 8, 2018

Can cuda-memcheck disturb multi-threaded multi-gpu CUDA applications' synchronization structure?

Related topics