Can cuda-memcheck disturb multi-threaded multi-gpu CUDA applications' synchronization structure?

While I’m profiling my application, I observed this:

  • kernel is very simple that just adds to each array element their own thread-id(which is also array index).(async launch)
  • all cuda commands (nvrtc, driver) are wrapped with proper error-checking and no error is returned.
  • it is run on two quadro k420 cards concurrently with split array feed with just async mem copy.
  • compute returns true results when checked against expected results on host side.
  • visual studio’s debugging tool doesn’t detect any host-side memory leak.
  • nsight doesn’t detect anything wrong and shows that each k420 works as expected.
  • works 100% of time, for both debug and release modes
  • windows watchdog timer is not triggered as each kernel takes several microseconds only, even with cuda-memcheck
  • crashes randomly(~25% of time) or always(–force-blocking-launches yes) when used with cuda-memcheck
  • cuda-memcheck doesn’t find any error while doing this but says “process didn’t terminate successfully”(ofcourse only when application fails to complete its work).

windows 10 home, drivers 390.x, toolkit 9.1, 2xk420 (cuda compute capability 3.0), codes: driver api + nvrtc + 1 context per gpu, 1 thread per context

What should I do when debugging leads to “cuLaunchKernel” command and shows all parameters are non-zero but the exception message says

“Unhandled exception at 0x00007FFF87A6CD00 (nvcuda.dll) in myApp.exe: 0xC0000005: Access violation reading location 0x0000000000000000.”

which means something in nvcuda.dll tried to read zero pointer. There are also controls around this cuLaunchKernel against zero, they don’t return any error message neighter (they all have their own memory space in heap so they don’t become zero with scope issues).

documentation also says these commands can return any error from other asynchronous calls. Yes, I’m using asynchronous mem copies between host and device too so I disabled array copies but exception persists.

Also why does it crash 100% of time when I force kernels synchrony("–force-blocking-launches yes")? How would two cards be forced to wait each other while myApp is in control of multi-thread logic? Does it simply block kernel call?

What could be causing “blocking” driver api kernel launch to read a “zero pointer” while async one doesn’t? Does it hold the kernel launch on “driver-side” while myApp continues its threading? This would be bad for me then. Especially if it does same thing randomly even without “–force-blocking”. Is there a rule of thumb that says everything should be serialized before using any memory leak or cuda leak detection tool?

Also maybe not so important one to list but avira antivirus finds PUA/bitcoinminer.gen7(cloud) in myApp whenever I try to run it (so I put it in avira-exception list so it doesn’t scan it anymore) I always delete exe file, then build project a new. Does this mean visual studio is injecting things in it? But Nvidia’s nvrtc example runs fine without avira’s intervention. If something is wrong with runtime-compiling, should I re-check my cuda toolkit or drivers against infection? Why would avira finds myApp different even though it gets compiled on same visual studio instance, same machine, same drivers as Nvidia’s nvrtc sample project?

cuda-memcheck perturbs execution order of both warps and threadblocks. Therefore, race conditions that are less evident normally may become more evident when running an app that way. None of these perturbations is in violation of the CUDA execution model, so an app that fails in any way under cuda-memcheck may be suspect.

Furthermore, running an app under cuda-memcheck usually makes the kernel runtimes much longer, possibly 10x or longer in many cases. Therefore, an app that is not triggering the windows WDDM TDR system may trigger it when run under cuda-memcheck, due to the increase in kernel runtime.

Have you modified or disabled the WDDM TDR watchdog?

I haven’t modified WDDM TDR watchdog. Also can it make a microsecond kernel a “second” kernel? Okay, I’ll change the value. It should’ve been 7 seconds iirc but app crash was in 3-4 seconds.

I increased TdrDelay from 8 seconds to 80 seconds, still it fails in first 2-3 seconds with --force-blocking option.

For now, trying to narrow down the area it causes crash, by adding std::cout at places. Just figured it is not crashing 100%, it crashes 99.99% of time, by seeing it crashing after several thousand calls, not at first call. There are 30k calls in total.

Does this documentation help on this issue? It says

“Kernel launches larger than 16MB are not currently supported by CUDA‐MEMCHECK and may return erroneous results.”

How can I know how much does my kernel launch needs? Do they add up together when they are launched one after another? Are they on the level of kB range each?

That appears to be coming from some old documentation: (2012)

The current cuda-memcheck documentation is here:

I couldn’t find that reference there. Unless you are using a CUDA version from ~2012 (say CUDA 6 or before) I would disregard that from the old doc. It’s always good practice to use the latest CUDA versions and refer to the latest docs, which are found here: