hardware stack error?

One of my kernels is now throwing a CUDA_ERROR_HARDWARE_STACK_ERROR on Kepler devices targeting sm_30. This is with CUDA 7.0 compiled in 32 bit mode (both the host application and the kernel code).

I am wondering what might have caused this error to occur. Is there anything the programmer can control in terms of assigned stack size?

There is not much to be learned from the PTXAS output, I guess.

ptxas info    : Compiling entry function '_Z31Transmission_Kernel_MassiveMIMOjjjjfjPKfS0_PK12TransmissionS3_PK16Precoding_VectorS6_PKjS8_S8_S8_S8_S0_PKhPKySC_jjffffP8MimoSINRPf' for 'sm_30'
ptxas info    : Function properties for _Z31Transmission_Kernel_MassiveMIMOjjjjfjPKfS0_PK12TransmissionS3_PK16Precoding_VectorS6_PKjS8_S8_S8_S8_S0_PKhPKySC_jjffffP8MimoSINRPf
    176 bytes stack frame, 332 bytes spill stores, 232 bytes spill loads
ptxas info    : Used 63 registers, 7272 bytes smem, 468 bytes cmem[0], 108 bytes cmem[2], 8 textures

This is what cuda-memcheck gives me

========= Hardware Stack Overflow
=========     at 0x00002fe0 in Transmission_Kernel_MassiveMIMO(unsigned int, unsigned int, unsigned int, unsigned int, float, unsigned int, float const *, float const *, Transmission const *, Transmission const *, Precoding_Vector const *, Precoding_Vector const *, unsigned int const *, unsigned int const *, unsigned int const *, unsigned int const *, unsigned int const *, float const *, unsigned char const *, __int64 const *, __int64 const *, unsigned int, unsigned int, float, float, float, float, MimoSINR*, float*)
=========     by thread (0,3,0) in block (1,0,0)
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame:/usr/lib/i386-linux-gnu/libcuda.so.1 [0x2b3914]
=========     Host Frame:[0x51934d00]
=========
========= Program hit cudaErrorLaunchFailure (error 4) due to "unspecified launch failure" on CUDA API call to cudaDeviceSynchronize. 
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame:/usr/lib/i386-linux-gnu/libcuda.so.1 [0x2b3914]
=========     Host Frame:[0x51934c00]
=========
========= Program hit cudaErrorLaunchFailure (error 4) due to "unspecified launch failure" on CUDA API call to cudaGetLastError. 
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame:/usr/lib/i386-linux-gnu/libcuda.so.1 [0x2b3914]
Transmission_Kernel_MassiveMIMO CUDA error: unspecified launch failure
=========     Host Frame:[0x51934c00]
=========

The same code still runs fine on Maxwell targeting sm_50.

Christian

Argh, even going from 256 just 32 threads per block did not fix the problem. I believe I’ve triggered some compiler bug that makes this go into infinite recursion or something…

I can’t say I have ever encountered it, but as a first check, you might want to examine the current stack setting by calling cudaDeviceGetLimit() with attribute of cudaLimitStackSize, which returns the default, per-thread, stack size. According to the Programming Guide, the maximum local memory allowed is 512KB per thread across all architecture.

(1) Does the launch configuration specify a huge number of threads, thus possibly exhausting memory? The total local memory required is at least the product of resident threads times local memory use per thread (likely more due to allocation granularity; this granularity could be architecture dependent)

(2) Is the kernel called recursively or invoke recursive device functions, thereby possibly exhausting the pre-allocated per-thread stack memory via unexpected recursion depth?

(3) Is it possible there is an out-of-bounds access to local memory (e.g. when accessing a thread local array)?

What happens if you try to increase the per-thread stack limit with cudaDeviceSetLimit()? I am also wondering whether some additional restrictions may apply for code compiled in 32-bit mode due to the 4 GB address space limit. I haven’t used 32bit CUDA applications in many years.