Hiho =) I have a problem:
I use compute capability 5_2 on a GTX 970 in MS VS 2013. I have a kernel in which I allocate a 3D-float-array of shared memory. Its dimensions are the block dimensions + 2, so when I use 8x8x8 kernels in one block, it allocates 10x10x10x4bytes=4000bytes of shared memory.
I want to have 32x8x4 kernels in one block, which would be about 8Kilobytes of shared mem per block. I thought that my device is able to use 48Kilobyte per block, but with the 32x8x4 configuration, the kernel doesn’t launch in release mode (in debug mode everything works fine).
I tried some configurations and found that 6240Bytes of shared memory work (13x11x6 threads), but 6336Bytes don’t work (10x10x9 threads). Without the use of shared memory the rest of the code works fine. So I think, there will be the problem.
I’m looking foreward to any advices.