Compute-sanitizer not catching cudaErrorIllegalAddress

We’ve got a Windows-based CUDA application using CUDA SDK 11.1 Update 1 that performs some number crunching. One configuration of our application is reporting a cudaErrorIllegalAddress during a cudaStreamSynchronize() after downloading some data from device to the host. Nothing helpful in our logging: this doesn’t happen in our debug builds (which has much more extensive error-checking and so runs significantly slower).

Added some extra stream-sync calls prior to the download call to our release build, and confirm the illegal-address error appears to be occurring at (one of) the many kernels prior to the download & stream-sync calls. Doing a stream-sync after each kernel significantly slows things down again, which prevents the original cudaErrorIllegalAddress from happening.

OK, so fired up compute-sanitizer with the memcheck option “-tool memcheck <our_binary>” to try to trap the underlying kernel execution or API call. Now the weird part: compute-sanitizer doesn’t catch the error when it occurs. But we do still get that error from the runtime/API and our application logs it.

Our logging shows a cudaStreamSynchronize() returning the same cudaErrorIllegaAddress prior to the download call (cudaMemcpyAsync with device->host).

But compute-sanitizer doesn’t show any errors – not a kernel causing issues, nor a specific API with bad arguments. Recompiled our kernels with line-info in case that effects the tools ability to show “where” in the kernel the error was occurring, but same thing: compute-sanitizer doesn’t show any errors.

What am I missing? compute-sanitizer is definitely doing something, as performance drops through the floor when running our application. But I’m struggling to even come up with an idea of how the CUDA runtime would catch this error and return it to us for logging…but compute-sanitizer wouldn’t catch it prior to that?

Original logic when first detecting the error:

Async Kernels A > B > C > (etc)… download(async memcpy)…stream-sync = illegal-address.

Modified code when trouble shooting

Async Kernels A > B > C > (etc)…stream-sync = illegal-address…never get to download(async memcpy)

In both cases above, sync’ing after or prior to the download (async memcpy) API call, compute-sanitizer doesn’t flag any issues. But the runtime returns the error code.

the illegal address error on cudaMemcpy D2H is a typical symptom of a kernel (device code) that has done something illegal.

I would recommend trying cuda-memcheck using the executable from the command line also. Compiling with lineinfo is good.

Usually cuda-memcheck is good at finding such issues. However, cuda-memcheck does affect device code scheduling behavior. If the scheduling behavior affects the algorithmic behavior (e.g. a data-dependent problem) then I suppose it may be possible that this cuda-memcheck characteristic could mask the problem.

In your particular case, it may also be useful to try to localize things a bit by putting a stream sync after each kernel to see if you can localize the error to a particular kernel call, at least.

Thanks for the feedback: agreed that the illegal-address on the D2H is probably a kernel prior. The pre-D2H stream-sync I initially added agrees that it’s not the download stage that’s causing the issue.

The “compute-sanitizer” tool is definitely altering the scheduling behavior, but not masking it. It’s just failing to trap/catch the error – it comes back to our application still via the API return code, and we log it when run via compute-sanitizer like we do when run “normally”. It seems odd that it would not be aware of the error occurring at all? Nothing in compute-sanitizer on the cmd-prompt output shows any errors/warnings/etc.

I can revert our test to using the old-school cuda-memcheck with CUDA_MEMCHECK_PATCH_MODULE=1 and see if the old tools pick it up.

Per the localization using stream-sync – agreed, tried a few variations but balancing the timing from “sync all but now it’s too slow” is slow & cumbersome, may eventually boil down to just brute force manually syncs to isolate further.

Thanks again, will report back if cuda-memcheck agrees/disagrees with compute-sanitizer.

Yes, I agree. There doesn’t seem to be a rational explanation for the tool not catching it when the API reports the error.

If all else fails, reducing it to a minimal reproducer may shed light on the issue, or at least prepare you to file a bug if you choose to do so.

Are you using CUFFT by any chance?

Yes, a simple/minimal repro would be an ideal outcome.

Yes, we are using both cuFFT & cuRAND alongside our custom kernels.

No joy with the old-school cuda-memcheck tools from 10.x, just seemed to blow up the system RAM (was north of 20GB after 8 hours+ and wasn’t even fully loaded yet). Will continue down the “insert stream syncs slowly” and try to isolate it with manual testing.

The only reason I asked about CUFFT is because you had mentioned that “I can revert our test to using the old-school cuda-memcheck with CUDA_MEMCHECK_PATCH_MODULE=1” and the only time in my experience where I ever needed to use that was when I was using CUFFT (e.g. see here) I simply wanted to see if this report was similar or different. Seems to be similar. Nothing beyond that. I have no reason to think the problem here is somehow uniquely related to CUFFT.

Ah, OK - yeap, it was probably 15+ months back when we had to add the patch-module=1 env. var., cuFFT and our application started having issued under cuda-memcheck (apparently due to total # of kernels?). Once we added that everything was back to normal.

So I’ve got two small updates with the manual testing: after adding a handful of stream-syncs the underlying issue had hidden again, so I removed them to just re-establish that I’m still seeing the underlying error. This testing is just run under VS2019 debugger, no compute-sanitizer nor cuda-memcheck:

1.I get a hang in the application now. Pause and break into the debugger and that same area that was previously getting back an illegal-address on stream-sync diectly after the D2H copy is now hung up deep inside cudaStreamSync (cudart::cudaApiStreamSynchonize…and then down into nvcuda64.dll).

  1. Only once, I got a failure-to-launch error instead of illegal-address, I’ve seen this error be some what interchangeable with illegal-address on different previous issues, so not sure this is helpful information.

Not sure if #1 and the cudaStreamSync() hang behavior is helpful to isolating this further?

Once you get an illegal address, then the CUDA context is corrupted, and you would get a failure on any subsequent CUDA thing that you tried. I realize this is not your description exactly, but a failure to launch (with otherwise valid launch parameters) could indicate a previous error that corrupted the CUDA context. If you have a large code base sometimes it is useful if you get a repeatable failure like this to do cuda error checking (cudaGetLastError(), possibly preceded by a stream sync or device sync) before the error that you are observing, to see if it is occurring “somewhere else”.

Again, not fitting your description here, exactly, but I’m grasping right now.

The other thing I will say for exceptionally squirrely behavior that manifests in host code would be to consider the possibility of host stack corruption. I’ve seen host stack corruption manifest on entry to CUDA calls quite a few times. The CUDA call fails in a bizarre way, and eventually I figure out there is host stack corruption. I’m not saying I know that to be the case here, but it’s sometimes what I grasp at when I cannot see anything else to grasp. For host stack corruption investigation, the preferred tool (I think) is valgrind, but unfortunately running valgrind on a CUDA code can be a very mixed bag. It might be useful, it might not, and will probably require some patience and tolerance and reading of the tea-leaves. I don’t know what the equivalent to valgrind would be on windows.

Other than that, I’m pretty clueless about what may be happening here.

Hm, host side corruption is definitely something we haven’t examined deeply. We do have some tools on that front that we haven’t reviewed in this setup, I’ve been locked into device-side code but that’s a good suggestion!

I’ve been tearing our framework down to decompose it into the most basic testing I can – things are still very weird. At this point, I’m suspicious there is a either a large-gap in my understanding of some of the most basic behaviors of the APIs, or there are some issues in the CUDA runtime in WDDM mode:

  1. Things are still timing sensitive. Inserting extra redundant calls to cudaSetDevice() calls seems to impact the sample test code.
  2. Removing our kernel framework and just putting in some very simple “set zero” kernels shows the issue, on both a multi-GPU system (x2 RTX 2070 SUPERs), as well as a single GPU system (x1 GTX 1070).

But before I write up and submit a bug, some level set questions:
a. Even under WDDM, there isn’t a limit to the # of outstanding async kernel requests submitted per stream, are there? Obviously there are limits to resources, and eventually something may block, but hangs shouldn’t happen.
b. Multiple-GPUs are supported under WDDM, correct?

a. The expected behavior is that if you launch a large enough number of outstanding async kernel requests, eventually the behavior will be that the kernel launch becomes a blocking operation, but nothing should “hang”.
b. Yes, you can use multiple GPUs under WDDM.

I usually recommend that people advance their software installs to the latest driver and latest CUDA version, before submitting bugs.

OK, on the same page per A & B – that’s the behavior I understood.

We are on CUDA SDK 11.1 Update 1…drivers should be within the last 2 weeks, but will review & update if anything newer prior to submitting some of these tests for the WDDM runtime.

Thanks!

WDDM & Multi-GPU > CUDA runtime hangs & launch-failures | NVIDIA Developer

Bug submitted – stripped down, stand-alone .cu file that reproduces a simple version of the symptoms on a multi-GPU / WDDM platform sent to the NVSDKIssues address.

Thanks!

I was finally able to trap the error on a specific kernel causing the cudaIllegalAddress on our multi-GPU box (sync’s after every kernel call reduces the frequency dramatically but it did finally occur). But I’m highly skeptical this kernel was actually causing it. It’s a simple “add” kernel (a += b) that is as bare-bones as it gets. No shared memory nor reduction, no atomic ops, etc., nothing fancy in the loop.

In our complicated multi-thread & multi-stream production code, that simple add kernel (a += b) in a deep nested stack caught the illegal-address after doing a post-kernel sync() call. However, it had been running for 3+ hours and executing that same kernel on that same data-set in that same thread & stream. So I’m still highly suspicious that the previously submitted bug showing an example of cudaLaunchKernel getting into a bad state is still somehow related here leading to corruption of the parameters somehow.

But it does lead to an interesting side effect on the current CUDA API design around the treatment of errors as “global” state – that thread may have been the first to “see” the error – but there’s not currently a way that I see where we can detect just errors that have occurred due to a specific stream’s behavior.

e.g. on a 8 thread + 8 stream (1:1) scenario, thread #3 causing an error may be caught by thread #5 due to the “global” nature of the error state API design. Is there a reason not to support keeping stream-errors to stream-local state?

Debugging and tracking down root causes in multi-thread/multi-stream scenarios would be much cleaner & simpler with the introduction of a configuration option (perhaps a new flag to cudaStreamCreateWithFlags alongside the non-blocking flag that exists today?) to keep errors “local” to within the stream itself (along with a cudaStreamGetLastError/cudaStreamPeekLastError).

Defaulting to the current behavior would maintain legacy compatability, but for a heavily threaded/multi-stream based application that wanted to use this new flag (and the corresponding cudaStreamGet/PeekLastError) it would make life much better.

Isolated with a reproduceable test on single & multi GPU systems as bug #3206933.

cuFFT causes illegal-address errors on a multi-GPU system | NVIDIA Developer