Shared memory limits and cudaError_enum How to precisely determine how much of the shared memory is

CTom · April 24, 2009, 9:42am

Hi all,

I’m trying to improve the rapidity of the execution of a kernel. In order to achieve a certain speed, i want to use as much threads as possible.

Since i’m using a quadro FX 4600, the compute capibility is 1.0, and the limitation for the size of a bloc is 512 and for the grid 65535 for each dimension.

When I increase the size of my blocks I have the error cudaError_enum.

This is certainly related to a problem concerning the shared memory, since I’m using the size of my blocks to determine the size that will be occupied in shared memory :

At most my block dimension are equal to (17,17) which corresponds to a number of 289 (<512) threads ber block.

And the grid size for each dimension is (17408,13056) (<65535).

#define U16 unsigned short int

//Shared memory

	__shared__  float disp_sdata[block_size_x][block_size_y]; // shared memory for the disparity map

	__shared__  U16 colorR_sdata[block_size_x][block_size_y];

				__shared__  U16 colorG_sdata[block_size_x][block_size_y];

				__shared__  U16 colorB_sdata[block_size_x][block_size_y];

So this means that i’m using

sizeof(float)block_size_xblock_size_y + 3*sizeof(unsigned short int)block_size_xblock_size_y

which is equal to

 4*289 + 3*1*289= <b>2023 bytes </b>

I also read in the sticky note that the built-in variables are using 16 bytes in shared memory.

As for the kernel parameters, i use the following :

float* d_idata, size_t pitch_in, 

	 U16 * d_RGBmap, 

	float* d_odata, size_t pitch_out,

		 unsigned int width, unsigned int height

Since the size of a pointer is the size of an address (4) as well as the size of size_t and the size of int, the parameters are using 4*7=28 bytes.

So the grand total is

2023+16+28=2067 bytes this is quite far from the limitation for shared memory which is equal to 16kbytes.

however whenever i use a block size of (18,18), which corresponds to (324 threads/block) a size of 2312 kbytes in shared memory, I get the error cudaError_enum. And i don’t understand why…

Is there something else that i’m forgetting ?

Does anybody else already had that kind of trouble before ?

I’m pretty sure that this is related to shared memory issues since i’ve gone through some topics inthe Nvidia forum that made very clear that there was a link between the error “cudaError_enum” and shared memory. Moreover the size of my blocks is the only parameters that i’m changing, before the error occcurs, so it definitely comes from here.

CTom · April 28, 2009, 1:51pm

UP

The following quotation is the exact error that i get.

cutilCheckMsg() CUTIL CUDA error: Kernel execution failed in file <d:/.................cu>, line 140 : too many resources requested for launch.

I don’t understand why such limitations are imposed given that i’m not using that much memory space. (<<16kB)

Could this issue arise from the fact that the graphic card that i’m using for calculations is also used for display ?

Fugl · April 28, 2009, 1:59pm

Have you checked how many registers your kernel uses? You make no mention of it, and thus it’s plausible you might be running out.

CTom · April 28, 2009, 3:26pm

Hi Fugl,

Thanks for you advice.

I have checked with the Visual Profiler, and if I use (17,17) blocks (maximum square size for a block), or ( (17,18) or (18,17), maximum block size). I always have 25 registers/thread.

For a configuration that has (18,18) blocks the program produces the error detailed above, and the kernel is not launched because of a lack of ressources. (but which ressources ? … for a moment I was convinced that it had to do with shared memory limitations, but apparently you think it might be linked to the registers).

Let’s take a closer look:

Since the QUADROFX 4600 that I’m using is a compute capibility 1.0 one; this means that i have 8192 registers available per block.

So with a block of size 17*17=289 threads/block. This means that I have 8192/289~=28.34 registers/thread (>25)

And with a block of size 17*18=306 threads/block. 8192/306=26,77 registers/thread (>25)

And with a block of size 18*18=324 threads/block (which is the limit from which i always have the error related to some lack of ressources) this means that I have 8192/324~=25,28 registers/thread. This is close to 25 but still above.

[b][i]So do you think that this is actually the reason why the kernel will not launch ?

Do you have an idea how the Profiler know that I need 25 registers to perform the computation ?

Last but not least : how is the # of registers used related to the program in itself ? (is it related to the # of blocks, or the size of blocks?)[/i][/b]

I’m open to any idea/suggestion !

Fugl · April 29, 2009, 8:06am

So with a block of size 17*17=289 threads/block. This means that I have 8192/289~=28.34 registers/thread (>25)

And with a block of size 17*18=306 threads/block. 8192/306=26,77 registers/thread (>25)

And with a block of size 18*18=324 threads/block (which is the limit from which i always have the error related to some lack of ressources) this means that I have 8192/324~=25,28 registers/thread. This is close to 25 but still above.

[b][i]So do you think that this is actually the reason why the kernel will not launch ?

Do you have an idea how the Profiler know that I need 25 registers to perform the computation ?

Last but not least : how is the # of registers used related to the program in itself ? (is it related to the # of blocks, or the size of blocks?)[/i][/b]

I’m open to any idea/suggestion !

It’s definitely because you run out of registers. Take a look at chapter 5.2 in the CUDA 2.1 programming guide. There’s the excact formula. For you numbers I got:

[codebox]Ceiling[25Ceiling[1717, 32], 8192/32] = 8192

Ceiling[25Ceiling[1718, 32], 8192/32] = 8192

Ceiling[25Ceiling[1818, 32], 8192/32] = 8960

[/codebox]

This matches perfect with what you see. When you get above 320 threads per block (~17.88^2) you do not have enough registers to run a thread block on the multiprocessor.

The profiler most likely hooks into the driver and asks for the size of the cubin that the GPU is currently running.

The number of registers is dependent on the complexity of you kernel. If your kernel is very long and with complex calculations/large amount of data register usage will increase.

CTom · April 29, 2009, 12:41pm

I had seen this note in the programming guide… but I guess I’m too dumb to link the theory to the facts.
Thanks anyway, this was definitely my problem !

Topic		Replies	Views
Kernel execution failed: Too many resources.. CUDA Programming and Performance	8	11520	November 29, 2007
Size limit on dynamic allocated shared memory CUDA Programming and Performance	2	1514	November 6, 2008
Can not use more than 16*256 threads! CUDA Programming and Performance	7	2528	August 4, 2008
ERROR: too many resources requested for launch. CUDA Programming and Performance	8	26205	December 16, 2009
regsPerBlock CUDA Programming and Performance	4	2522	September 28, 2008
Launch out of Resources: Why? CUDA Programming and Performance	12	14757	May 28, 2008
how to determine max number of blocks per kernel CUDA Programming and Performance	10	17321	September 11, 2011
too many resources requested for launch CUDA Programming and Performance	28	25010	December 1, 2010
Newbie: Maximum Block Size CUDA Programming and Performance	8	3458	July 22, 2008
max number of block CUDA Programming and Performance	21	18013	April 20, 2010

Shared memory limits and cudaError_enum How to precisely determine how much of the shared memory is

Related topics