Just found an illegal memory accessing issue in CUDA caused by my code bug. The bug has been fixed but I still feel it is very intriguing and want to know more details if possible.
The essential problem is when I’m trying to launch a series of CUDA kernels sequentially, I made a mistake that several kernels will write data to the same address. At a certain point, CUDA reports “illegal memory access” which makes sense that is definitely an annoying job to the GPU. So I fixed the code defect in my memory management module, ensure these kernels are operating on separate memory chunks. The problem is gone.
The part I still have questions are:
- I only use the default stream on the CUDA context, which means that these CUDA kernels will be executed in order. I don’t quite understand why these non-concurrent operations will cause so much trouble.
- It seems CUDA could tolerate one or two of these overlapped kernels, the “illegal memory access” only appears when I got quite a few kernels trying to write to the same memory. (I don’t have an exact number yet).
- Another theory I could think of is, in those overlapped kernels, the same memory address will be treated as different types of data (mostly int, float, vec3, vec4). Maybe that’s what makes CUDA angry.
Due to the complexity of my dev environment, I have not successfully attached CUDA-GDB or memory checker yet. From the above description, it may be related to how CUDA schedules and switches context between kernels. Anyone who can provide some insight would be much appreciated.
Thanks.