Meaning/occurence of 'CUDA_EXCEPTION_9: Warp Hardware Stack Overflow'

I repeatedly get the error message ‘CUDA_EXCEPTION_9: Warp Hardware Stack Overflow’ in one of my methods trying to use malloc on the device. Using a very basic test case, malloc seems to work but for my actual function it crashes. Unfortunatelly the cuda-gdb manual does not explain what may cause this error or how to fix it. The only description given is

I have tried to increase the possible malloc size using

cudaDeviceSetLimit(cudaLimitMallocHeapSize, testSize)

and using the equivalent …GetLimit() confirms my changed size. Since the amount of data required in my test cases isn’t too big anyway, I doubt a lack of heap memory directly causes the problem. Using cuda-memcheck within cuda-gdb gave the same results.

I am using a GTX 460 with CUDA 4.0, OS: Ubuntu 10.04 LTS 64bits;

NVRM version: NVIDIA UNIX x86_64 Kernel Module 280.13 Wed Jul 27 16:53:56 PDT 2011

For the time being it would already be quite helpful for me to understand the precise meaning of the error message (including possible implications on blocks/threads,…). Why “This should be a rare occurrence”, what would be expected?!

Can you share your code?

Soory, it took me some time to do some more checks…

I don’t have a small example demonstrating the same effect and I cannot share the entire code. The error occurs in one of the lines within the if-block (usually the first one):

#define MAX_NUM_BLOCKS 20000

__device__ int* dev_data_list_ptr[MAX_NUM_BLOCKS];

__device__ int* dev_start_param_list[MAX_NUM_BLOCKS];

__global__ void myKernelFunction()


// a lot of stuff

int totalNum = [a value between 0 and ~ 200];

if (threadIdx.x == 0 && totalNum > 0)


	dev_data_list_ptr[blockIdx.x] = (int*)malloc(totalNum * sizeof(int));

	dev_start_param_list[blockIdx.x] = (int*)malloc(totalNum * sizeof(int));



// write results to created arrays


The actual number of blocks created depends on previous calculations. I have tried to find out how much memory is allocated in all threads by summing up all totalNums in a global variable via atomicAdd(). The result was 82080, that is 82080 * 2 * sizeof(int) bytes should have been allocated - which is not very much.

Perhaps the error message is a strange side effect of another error, so it might already be helpful if you could tell me more about the circumstances that lead to the very error message (CUDA_EXCEPTION_9: Warp Hardware Stack Overflow)