While I’m profiling my application, I observed this:
- kernel is very simple that just adds to each array element their own thread-id(which is also array index).(async launch)
- all cuda commands (nvrtc, driver) are wrapped with proper error-checking and no error is returned.
- it is run on two quadro k420 cards concurrently with split array feed with just async mem copy.
- compute returns true results when checked against expected results on host side.
- visual studio’s debugging tool doesn’t detect any host-side memory leak.
- nsight doesn’t detect anything wrong and shows that each k420 works as expected.
- works 100% of time, for both debug and release modes
- windows watchdog timer is not triggered as each kernel takes several microseconds only, even with cuda-memcheck
- crashes randomly(~25% of time) or always(–force-blocking-launches yes) when used with cuda-memcheck
- cuda-memcheck doesn’t find any error while doing this but says “process didn’t terminate successfully”(ofcourse only when application fails to complete its work).
windows 10 home, drivers 390.x, toolkit 9.1, 2xk420 (cuda compute capability 3.0), codes: driver api + nvrtc + 1 context per gpu, 1 thread per context
What should I do when debugging leads to “cuLaunchKernel” command and shows all parameters are non-zero but the exception message says
“Unhandled exception at 0x00007FFF87A6CD00 (nvcuda.dll) in myApp.exe: 0xC0000005: Access violation reading location 0x0000000000000000.”
which means something in nvcuda.dll tried to read zero pointer. There are also controls around this cuLaunchKernel against zero, they don’t return any error message neighter (they all have their own memory space in heap so they don’t become zero with scope issues).
documentation also says these commands can return any error from other asynchronous calls. Yes, I’m using asynchronous mem copies between host and device too so I disabled array copies but exception persists.
Also why does it crash 100% of time when I force kernels synchrony(“–force-blocking-launches yes”)? How would two cards be forced to wait each other while myApp is in control of multi-thread logic? Does it simply block kernel call?
What could be causing “blocking” driver api kernel launch to read a “zero pointer” while async one doesn’t? Does it hold the kernel launch on “driver-side” while myApp continues its threading? This would be bad for me then. Especially if it does same thing randomly even without “–force-blocking”. Is there a rule of thumb that says everything should be serialized before using any memory leak or cuda leak detection tool?
Also maybe not so important one to list but avira antivirus finds PUA/bitcoinminer.gen7(cloud) in myApp whenever I try to run it (so I put it in avira-exception list so it doesn’t scan it anymore) I always delete exe file, then build project a new. Does this mean visual studio is injecting things in it? But Nvidia’s nvrtc example runs fine without avira’s intervention. If something is wrong with runtime-compiling, should I re-check my cuda toolkit or drivers against infection? Why would avira finds myApp different even though it gets compiled on same visual studio instance, same machine, same drivers as Nvidia’s nvrtc sample project?