One of my kernels is now throwing a CUDA_ERROR_HARDWARE_STACK_ERROR on Kepler devices targeting sm_30. This is with CUDA 7.0 compiled in 32 bit mode (both the host application and the kernel code).
I am wondering what might have caused this error to occur. Is there anything the programmer can control in terms of assigned stack size?
There is not much to be learned from the PTXAS output, I guess.
ptxas info : Compiling entry function '_Z31Transmission_Kernel_MassiveMIMOjjjjfjPKfS0_PK12TransmissionS3_PK16Precoding_VectorS6_PKjS8_S8_S8_S8_S0_PKhPKySC_jjffffP8MimoSINRPf' for 'sm_30'
ptxas info : Function properties for _Z31Transmission_Kernel_MassiveMIMOjjjjfjPKfS0_PK12TransmissionS3_PK16Precoding_VectorS6_PKjS8_S8_S8_S8_S0_PKhPKySC_jjffffP8MimoSINRPf
176 bytes stack frame, 332 bytes spill stores, 232 bytes spill loads
ptxas info : Used 63 registers, 7272 bytes smem, 468 bytes cmem[0], 108 bytes cmem[2], 8 textures
This is what cuda-memcheck gives me
========= Hardware Stack Overflow
========= at 0x00002fe0 in Transmission_Kernel_MassiveMIMO(unsigned int, unsigned int, unsigned int, unsigned int, float, unsigned int, float const *, float const *, Transmission const *, Transmission const *, Precoding_Vector const *, Precoding_Vector const *, unsigned int const *, unsigned int const *, unsigned int const *, unsigned int const *, unsigned int const *, float const *, unsigned char const *, __int64 const *, __int64 const *, unsigned int, unsigned int, float, float, float, float, MimoSINR*, float*)
========= by thread (0,3,0) in block (1,0,0)
========= Saved host backtrace up to driver entry point at kernel launch time
========= Host Frame:/usr/lib/i386-linux-gnu/libcuda.so.1 [0x2b3914]
========= Host Frame:[0x51934d00]
=========
========= Program hit cudaErrorLaunchFailure (error 4) due to "unspecified launch failure" on CUDA API call to cudaDeviceSynchronize.
========= Saved host backtrace up to driver entry point at error
========= Host Frame:/usr/lib/i386-linux-gnu/libcuda.so.1 [0x2b3914]
========= Host Frame:[0x51934c00]
=========
========= Program hit cudaErrorLaunchFailure (error 4) due to "unspecified launch failure" on CUDA API call to cudaGetLastError.
========= Saved host backtrace up to driver entry point at error
========= Host Frame:/usr/lib/i386-linux-gnu/libcuda.so.1 [0x2b3914]
Transmission_Kernel_MassiveMIMO CUDA error: unspecified launch failure
========= Host Frame:[0x51934c00]
=========
The same code still runs fine on Maxwell targeting sm_50.
Christian