A illegal memory accessing issue caused by multiple kernels

Just found an illegal memory accessing issue in CUDA caused by my code bug. The bug has been fixed but I still feel it is very intriguing and want to know more details if possible.

The essential problem is when I’m trying to launch a series of CUDA kernels sequentially, I made a mistake that several kernels will write data to the same address. At a certain point, CUDA reports “illegal memory access” which makes sense that is definitely an annoying job to the GPU. So I fixed the code defect in my memory management module, ensure these kernels are operating on separate memory chunks. The problem is gone.

The part I still have questions are:

  1. I only use the default stream on the CUDA context, which means that these CUDA kernels will be executed in order. I don’t quite understand why these non-concurrent operations will cause so much trouble.
  2. It seems CUDA could tolerate one or two of these overlapped kernels, the “illegal memory access” only appears when I got quite a few kernels trying to write to the same memory. (I don’t have an exact number yet).
  3. Another theory I could think of is, in those overlapped kernels, the same memory address will be treated as different types of data (mostly int, float, vec3, vec4). Maybe that’s what makes CUDA angry.

Due to the complexity of my dev environment, I have not successfully attached CUDA-GDB or memory checker yet. From the above description, it may be related to how CUDA schedules and switches context between kernels. Anyone who can provide some insight would be much appreciated.

Thanks.

I am pretty sure you are misdiagnosing the issue. Kernels in the same stream will always execute in the order in which they were launched. The kernel launch also serves as a memory fence between kernels in the same stream, i.e. if you have kernel A write to address X, and a subsequent kernel B reads from or writes to address X, there is no read-after-write or write-after-write hazard.

Check for out-of-bounds indexing leading to out-of-bounds memory access, any use of uninitialized data, and race conditions inside each kernel (cuda-memcheck can find some but not all instances of race conditions).

Thanks for confirming the execution order on same stream. I will dig deeper into my system to see what else could be wrong.