I’ve been tearing my hair out trying to track down this bizarre bug where I keep getting stack overflow errors in my kernels. This is despite the code in question being non-recursive and the sum of locally allocated data not exceeding the stack size limit.
In attempting to debug the problem, I created an isolated test case where I copied the broken kernel and all its dependencies into their own source files and separated them from the rest of the project. All other code was commented out so as to be unreachable, however the compiler was still building and linking it as part of the final debug executable.
I then took the source files of my isolated test case and installed them in a brand new project. The crashes stopped happening. In other words, the code only crashes when it’s compiled “in situ”.
I went back and bisected the original project, progressively removing compilation units until I narrowed the problem down to a single source file. I then progressively removed chunks of the code in that file to narrow things down still further.
Unless I’ve overlooked something, it appears that classes with virtual inheritance are somehow corrupting the stack frames of code elsewhere in the program. I confirmed this by making the class inheritance non-virtual and the crashes stopped. Again, this is unreachable code that never gets executed and which would likely be optimised away in a release build.
In a way this makes a strange sort of sense given that nvcc attempts to estimate stack frame sizes during compilation, and virtual functions sometimes prevent it from doing so. I’ve recently started using virtual inheritance because it offered a more convenient way of reusing code on the both host and device. It was also around that time that the stack overflow problems began to happen.
I don’t really know what nvcc is doing under the hood so this feels like a bit of a shot in the dark.
Does this explanation sound remotely plausible?
My system’s running a GeForce RTX 2080, Windows 10, Cuda 11.8, Visual Studio 2022.