Kernel doesn't launch in release mode, dependent on how much shared mem I allocate...?

catrexis · April 13, 2016, 3:49pm

Hiho =) I have a problem:
I use compute capability 5_2 on a GTX 970 in MS VS 2013. I have a kernel in which I allocate a 3D-float-array of shared memory. Its dimensions are the block dimensions + 2, so when I use 8x8x8 kernels in one block, it allocates 10x10x10x4bytes=4000bytes of shared memory.
I want to have 32x8x4 kernels in one block, which would be about 8Kilobytes of shared mem per block. I thought that my device is able to use 48Kilobyte per block, but with the 32x8x4 configuration, the kernel doesn’t launch in release mode (in debug mode everything works fine).

I tried some configurations and found that 6240Bytes of shared memory work (13x11x6 threads), but 6336Bytes don’t work (10x10x9 threads). Without the use of shared memory the rest of the code works fine. So I think, there will be the problem.

I’m looking foreward to any advices.

Robert_Crovella · April 13, 2016, 10:16pm

I would suggest providing a short, complete reproducer of the problem. Are you doing proper cuda error checking? If so, what is the exact error text reported when your kernel fails to launch?

I am wondering if this is just a situation where you are running into the windows WDDM TDR watchdog.

catrexis · April 14, 2016, 2:01pm

Okay I found the problem:
I had to many registers per block (69*1024). Is there a way to see, for which variable registers are used? Because I deleted 6 floats in my kernel, that I didn’t really need and ended up with 2 more registers per thread than before. So it seems to me, that the compiler decides how many ragisters the kernel will have in the end…
Or could I simply compile with compute capability 3.7, that allows 128K of threads per multiprocessor instead of 64k?

Robert_Crovella · April 14, 2016, 4:44pm

Yes, the compiler decides how many registers to use. You can’t calculate the actual register usage simply by looking at the C/C++ source code.

If you compile your code with -Xptxas -v you will get some additional compiler output indicating register usage. Register usage can also be inspected using one of the profilers (assuming the kernel will launch).

If you’d like to manage register usage directly, there are several approaches. I would suggest looking at launch bounds directive in the CUDA programming guide.

catrexis · April 14, 2016, 5:15pm

Could you explain me how to compile my code with that “-Xptxas -v”…
I read things like this very often, but I always don’t know how to do this in visual studio…

Robert_Crovella · April 14, 2016, 6:17pm

A google search on “ptxas verbose visual studio” returns this as the first hit:

[url]How to set CUDA flags in Visual Studio - Stack Overflow

catrexis · April 15, 2016, 1:19pm

okay it now worked with the option
max used registers = 64
thanks for your help =)

Topic		Replies	Views
Shared memory limits and cudaError_enum How to precisely determine how much of the shared memory is CUDA Programming and Performance	5	2844	April 29, 2009
Max Used Register compile setting affecting kernel launch? CUDA Programming and Performance	9	2403	May 5, 2015
Borrowed registers NVCC using Shared Memory for Registers CUDA Programming and Performance	3	4334	July 10, 2007
regsPerBlock CUDA Programming and Performance	4	2481	September 28, 2008
Kernel execution failed: Too many resources.. CUDA Programming and Performance	8	11469	November 29, 2007
Launch out of Resources: Why? CUDA Programming and Performance	12	14661	May 28, 2008
Kernel Execution problem CUDA Programming and Performance	5	852	March 6, 2013
shared memory and CUDA calculator CUDA Programming and Performance	6	4055	October 26, 2008
<500 threads and out of resources? 9600GT should support 512 threads/block CUDA Programming and Performance	9	3550	September 17, 2008
too many resources requested for launch CUDA Programming and Performance	28	24911	December 1, 2010

Kernel doesn't launch in release mode, dependent on how much shared mem I allocate...?

Related topics