Kernel doesn't launch in release mode, dependent on how much shared mem I allocate...?

Hiho =) I have a problem:
I use compute capability 5_2 on a GTX 970 in MS VS 2013. I have a kernel in which I allocate a 3D-float-array of shared memory. Its dimensions are the block dimensions + 2, so when I use 8x8x8 kernels in one block, it allocates 10x10x10x4bytes=4000bytes of shared memory.
I want to have 32x8x4 kernels in one block, which would be about 8Kilobytes of shared mem per block. I thought that my device is able to use 48Kilobyte per block, but with the 32x8x4 configuration, the kernel doesn’t launch in release mode (in debug mode everything works fine).

I tried some configurations and found that 6240Bytes of shared memory work (13x11x6 threads), but 6336Bytes don’t work (10x10x9 threads). Without the use of shared memory the rest of the code works fine. So I think, there will be the problem.

I’m looking foreward to any advices.

I would suggest providing a short, complete reproducer of the problem. Are you doing proper cuda error checking? If so, what is the exact error text reported when your kernel fails to launch?

I am wondering if this is just a situation where you are running into the windows WDDM TDR watchdog.

Okay I found the problem:
I had to many registers per block (69*1024). Is there a way to see, for which variable registers are used? Because I deleted 6 floats in my kernel, that I didn’t really need and ended up with 2 more registers per thread than before. So it seems to me, that the compiler decides how many ragisters the kernel will have in the end…
Or could I simply compile with compute capability 3.7, that allows 128K of threads per multiprocessor instead of 64k?

Yes, the compiler decides how many registers to use. You can’t calculate the actual register usage simply by looking at the C/C++ source code.

If you compile your code with -Xptxas -v you will get some additional compiler output indicating register usage. Register usage can also be inspected using one of the profilers (assuming the kernel will launch).

If you’d like to manage register usage directly, there are several approaches. I would suggest looking at launch bounds directive in the CUDA programming guide.

Could you explain me how to compile my code with that “-Xptxas -v”…
I read things like this very often, but I always don’t know how to do this in visual studio…

A google search on “ptxas verbose visual studio” returns this as the first hit:

okay it now worked with the option
max used registers = 64
thanks for your help =)