Just like a CPU, each CUDA thread maintains a stack. The stack is of fixed size. If your code behaves in such a way that it uses more stack space than that fixed size, it will result in a stack overflow. The machine is not ordinarily guaranteed to automatically catch such an error the moment it happens (although the resultant corrupted program behavior may eventually trigger a machine fault of some kind), but the additional instrumentation in compute-sanitizer
can detect it. here is a general description, not GPU-specific.
Along with the above general description, I would pay close attention to any compiler warning messages that are emitted when compiling your code. Sometimes compiler warning messages take the form of “… stack size … cannot be statically determined …” Any such messages may provide additional clues. I’m not suggesting that the compiler would always emit such a warning message, or that it is doing so in your case. However if there are any such warning messages, they may provide additional clues.
Here is an example program that triggers that stack overflow error in compute-sanitizer
:
# cat t48a.cu
#include <cstdio>
#include <iostream>
#include <cstdlib>
__device__ int a(int m, int n);
__global__ void k(int x, int y){
printf("%d\n", a(x,y));
}
int main(int argc, char *argv[]){
int val = 9; // 3 or higher seems to be enough to trigger the fault
if (argc > 1) val = atoi(argv[1]);
k<<<1,1>>>(val,val);
cudaError_t err = cudaDeviceSynchronize();
std::cout << cudaGetErrorString(err) << std::endl;
}
# cat t48b.cu
__device__ int a(int m, int n){
if (m == 0)
return n + 1;
if ((m > 0) && (n == 0))
return a(m-1,1);
else
return a(m-1,a(m,n-1));
}
# nvcc -rdc=true t48a.cu t48b.cu -o t48 -lineinfo
nvlink warning : Stack size for entry function '_Z1kii' cannot be statically determined
# compute-sanitizer ./t48
========= COMPUTE-SANITIZER
========= Stack overflow
========= at 0x10 in /root/bobc/t48b.cu:1:a(int, int)
========= by thread (0,0,0) in block (0,0,0)
========= Device Frame:/root/bobc/t48b.cu:7:a(int, int) [0x1d0]
========= Device Frame:/root/bobc/t48b.cu:7:a(int, int) [0x1d0]
========= Device Frame:/root/bobc/t48b.cu:7:a(int, int) [0x1d0]
========= Device Frame:/root/bobc/t48b.cu:7:a(int, int) [0x1d0]
========= Device Frame:/root/bobc/t48b.cu:7:a(int, int) [0x1d0]
========= Device Frame:/root/bobc/t48b.cu:7:a(int, int) [0x1d0]
========= Device Frame:/root/bobc/t48b.cu:7:a(int, int) [0x1d0]
========= Device Frame:/root/bobc/t48b.cu:7:a(int, int) [0x1d0]
========= Device Frame:/root/bobc/t48b.cu:7:a(int, int) [0x1d0]
========= Device Frame:/root/bobc/t48b.cu:7:a(int, int) [0x1d0]
========= Device Frame:/root/bobc/t48b.cu:7:a(int, int) [0x1d0]
========= Device Frame:/root/bobc/t48b.cu:7:a(int, int) [0x1d0]
========= Device Frame:/root/bobc/t48b.cu:7:a(int, int) [0x1d0]
========= Device Frame:/root/bobc/t48b.cu:7:a(int, int) [0x1d0]
========= Device Frame:/root/bobc/t48b.cu:7:a(int, int) [0x1d0]
========= Device Frame:/root/bobc/t48b.cu:7:a(int, int) [0x1d0]
========= Device Frame:/root/bobc/t48b.cu:7:a(int, int) [0x1d0]
========= Device Frame:/root/bobc/t48b.cu:7:a(int, int) [0x390]
========= Device Frame:/root/bobc/t48b.cu:7:a(int, int) [0x1d0]
========= Device Frame:/root/bobc/t48b.cu:7:a(int, int) [0x1d0]
========= Device Frame:/root/bobc/t48b.cu:7:a(int, int) [0x390]
========= Device Frame:/root/bobc/t48b.cu:7:a(int, int) [0x310]
========= Device Frame:/root/bobc/t48b.cu:7:a(int, int) [0x1d0]
========= Device Frame:/root/bobc/t48b.cu:7:a(int, int) [0x1d0]
========= Device Frame:/root/bobc/t48b.cu:7:a(int, int) [0x1d0]
========= Device Frame:/root/bobc/t48b.cu:7:a(int, int) [0x1d0]
========= Device Frame:/root/bobc/t48b.cu:7:a(int, int) [0x1d0]
========= Device Frame:/root/bobc/t48b.cu:7:a(int, int) [0x1d0]
========= Device Frame:/root/bobc/t48b.cu:7:a(int, int) [0x1d0]
========= Device Frame:/root/bobc/t48b.cu:7:a(int, int) [0x1d0]
========= Device Frame:/root/bobc/t48b.cu:7:a(int, int) [0x1d0]
========= Device Frame:/root/bobc/t48a.cu:6:k(int, int) [0x80]
========= Saved host backtrace up to driver entry point at kernel launch time
========= Host Frame: [0x32e950]
========= in /lib/x86_64-linux-gnu/libcuda.so.1
========= Host Frame:libcudart_static_4d8b33a106dceb3c07a56e26de61f2d53bb62a68 [0x1093e]
========= in /root/bobc/./t48
========= Host Frame:cudaLaunchKernel [0x70b4e]
========= in /root/bobc/./t48
========= Host Frame:cudaError cudaLaunchKernel<char>(char const*, dim3, dim3, void**, unsigned long, CUstream_st*) [0xb14e]
========= in /root/bobc/./t48
========= Host Frame:__device_stub__Z1kii(int, int) [0xb024]
========= in /root/bobc/./t48
========= Host Frame:k(int, int) [0xb05f]
========= in /root/bobc/./t48
========= Host Frame:main [0xae53]
========= in /root/bobc/./t48
========= Host Frame: [0x29d90]
========= in /lib/x86_64-linux-gnu/libc.so.6
========= Host Frame:__libc_start_main [0x29e40]
========= in /lib/x86_64-linux-gnu/libc.so.6
========= Host Frame:_start [0xab05]
========= in /root/bobc/./t48
=========
========= Program hit cudaErrorLaunchFailure (error 719) due to "unspecified launch failure" on CUDA API call to cudaDeviceSynchronize.
========= Saved host backtrace up to driver entry point at error
========= Host Frame: [0x47e786]
========= in /lib/x86_64-linux-gnu/libcuda.so.1
========= Host Frame:cudaDeviceSynchronize [0x48a64]
========= in /root/bobc/./t48
========= Host Frame:main [0xae58]
========= in /root/bobc/./t48
========= Host Frame: [0x29d90]
========= in /lib/x86_64-linux-gnu/libc.so.6
========= Host Frame:__libc_start_main [0x29e40]
========= in /lib/x86_64-linux-gnu/libc.so.6
========= Host Frame:_start [0xab05]
========= in /root/bobc/./t48
=========
unspecified launch failure
========= ERROR SUMMARY: 2 errors
#
(CUDA 12.2)
NOTES:
-
I lifted the recursive function definition from here (also see here).
-
The compiler seems to be quite good at converting my naive attempts at recursion into a partially unrolled loop.
-
In the above example, the error message printout at the end (“unspecified launch failure”) may change to some other message (e.g. “an illegal memory access was encountered”) if the code is not run under
compute-sanitizer
. This is an indication thatcompute-sanitizer
instruments the code and machine in such a way that the fault is detected differently. But in the general case I know of no guarantees that a runtime fault will be triggered in the event of stack overflow, without the use of a tool likecompute-sanitizer
. YMMV