Kernel execution in RTX2080 gets freezed, and cudaErrorIllegalAddress with nvprof

Hi,

I’m facing a really weird problem in my code. It was up and running 3 months ago, when I had CUDA 10.0. After updating to CUDA 10.1, nvprof is giving me a cudaErrorIllegalAddress where it was fine. Still, when using nvprof 10.0, it works fine. I don’t find any memory problems with cuda-memcheck. Executing the program is fine.

Moreover, I have configured a server with a RTX2080, same operating system (Ubuntu 18.04). Here, when I run the application, it just gets frozen at the first kernel execution and doesn’t go beyond (I have to kill the program with ctrl-c). With cuda-memcheck, the execution is also getting frozen. Both nvprof 10.0 and 10.1 are giving similar memory problems.

Any idea of what I am facing here?

Thank you very much,
Miguel

nvprof doesn’t report cudaErrorIllegalAddress.

Maybe your application is reporting that when you run it under nvprof.

You may be hitting a kernel timeout, if these GPUs are configured for X or being used for display.

Thanks for the answer.

Yes, sorry for my bad explanation, indeed it’s the application who is reporting the cudaErrorIllegalAddress, but only with nvprof and cuda-memcheck --tool synccheck, at a cudaMemCpy. If I run the application normally over the RTX2080, or even with cuda-memcheck or cuda-gdb, it gets stuck within the second kernel of the workflow. The GPU looks to work fine with CUDA samples and with old versions of our simulator.

But this is not happening with previous cards, where our current version works fine (eg K40, we recently tested a P100 with success too).

Our RTX2080’s are not being used for X, only the internal GPU.

Your observations are consistent with the presence of out-of-bounds memory accesses in your code.

Not every out-of-bounds access will trigger an illegal address exception, only those that happen to fall outside of memory allocations made by the program, which typically have page granularity. This is very much analogous as to what happens on CPUs, where one needs a tight, byte-accurate checker like valgrind to find all out-of-bounds accesses, with most such accesses typically being undetectable by standard OS mechanisms.

When running under control of cuda-memcheck or cuda-gdb, it is likely that memory allocations occur at different absolute memory addresses than when running without, which can cause the different symptoms you observe: Either the out-of-bounds accesses fall outside the memory allocations, triggering an exception, or they fall inside the memory allocations (rounded to the next page boundary), causing the code to malfunction (“gets stuck”).

As to why the code works fine on a K40 but not on RTX2080, there are several plausible hypotheses I can offer off the top of my head. One is that there has always been a latent out-of-bounds issue in this code, except things “happened to work” / “you got lucky” on K40 because differences in the GPU architecture to Turing (different instruction sequences, potential difference in memory allocation) caused the effects of that to be benign on K40. A second hypothesis is that there is a architecture specific code-generation issue that negatively affects Turing, i.e. a compiler bug. A third hypothesis is that there is some code idiom used that does not strictly conform to the CUDA programming model, “happened to work” in the past, but doesn’t any more due to architectural changes in the latest GPU architectures, and that observable out-of-bounds accesses are a consequence (follow-on error) of that non-conformity.

If this were my code, I would do a detailed code review of the kernel in question, looking for non-conformance with the CUDA programming model, machine-specific tricks, off-by-one indexing errors, and generally anything that looks suspicious, questionable, or “fishy”. I would use code bisection and code simplification to try to narrow down the proximate source of the out-of-bounds accesses. To look into the hypothesis of a compiler bug, I would try reducing the machine-specific optimizations with -Xpxtas -O{3|2|1|0}, where the compiler defaults to -Xpxtas -O3. Note that if the issue disappears at lower PTXAS optimization levels that this is not yet proof of a compiler bug. One would have to do a detailed back-annotation and review of the generated machine code to have definite indication of a bug, which is a task that requires skill, practice, and patience.