Shared memory limits and cudaError_enum How to precisely determine how much of the shared memory is

Hi all,

I’m trying to improve the rapidity of the execution of a kernel. In order to achieve a certain speed, i want to use as much threads as possible.

Since i’m using a quadro FX 4600, the compute capibility is 1.0, and the limitation for the size of a bloc is 512 and for the grid 65535 for each dimension.

When I increase the size of my blocks I have the error cudaError_enum.

This is certainly related to a problem concerning the shared memory, since I’m using the size of my blocks to determine the size that will be occupied in shared memory :

At most my block dimension are equal to (17,17) which corresponds to a number of 289 (<512) threads ber block.

And the grid size for each dimension is (17408,13056) (<65535).

#define U16 unsigned short int

//Shared memory

	__shared__  float disp_sdata[block_size_x][block_size_y]; // shared memory for the disparity map

	__shared__  U16 colorR_sdata[block_size_x][block_size_y];

				__shared__  U16 colorG_sdata[block_size_x][block_size_y];

				__shared__  U16 colorB_sdata[block_size_x][block_size_y];

So this means that i’m using

  • sizeof(float)block_size_xblock_size_y + 3*sizeof(unsigned short int)block_size_xblock_size_y

which is equal to

 4*289 + 3*1*289= <b>2023 bytes </b>

I also read in the sticky note that the built-in variables are using 16 bytes in shared memory.

As for the kernel parameters, i use the following :

float* d_idata, size_t pitch_in, 

	 U16 * d_RGBmap, 

	float* d_odata, size_t pitch_out,

		 unsigned int width, unsigned int height

Since the size of a pointer is the size of an address (4) as well as the size of size_t and the size of int, the parameters are using 4*7=28 bytes.

So the grand total is

2023+16+28=2067 bytes this is quite far from the limitation for shared memory which is equal to 16kbytes.

however whenever i use a block size of (18,18), which corresponds to (324 threads/block) a size of 2312 kbytes in shared memory, I get the error cudaError_enum. And i don’t understand why…

Is there something else that i’m forgetting ?

Does anybody else already had that kind of trouble before ?

I’m pretty sure that this is related to shared memory issues since i’ve gone through some topics inthe Nvidia forum that made very clear that there was a link between the error “cudaError_enum” and shared memory. Moreover the size of my blocks is the only parameters that i’m changing, before the error occcurs, so it definitely comes from here.

UP

The following quotation is the exact error that i get.

cutilCheckMsg() CUTIL CUDA error: Kernel execution failed in file <d:/.................cu>, line 140 : too many resources requested for launch.

I don’t understand why such limitations are imposed given that i’m not using that much memory space. (<<16kB)

Could this issue arise from the fact that the graphic card that i’m using for calculations is also used for display ?

Have you checked how many registers your kernel uses? You make no mention of it, and thus it’s plausible you might be running out.

Hi Fugl,

Thanks for you advice.

I have checked with the Visual Profiler, and if I use (17,17) blocks (maximum square size for a block), or ( (17,18) or (18,17), maximum block size). I always have 25 registers/thread.

For a configuration that has (18,18) blocks the program produces the error detailed above, and the kernel is not launched because of a lack of ressources. (but which ressources ? … for a moment I was convinced that it had to do with shared memory limitations, but apparently you think it might be linked to the registers).

Let’s take a closer look:

Since the QUADROFX 4600 that I’m using is a compute capibility 1.0 one; this means that i have 8192 registers available per block.

So with a block of size 17*17=289 threads/block. This means that I have 8192/289~=28.34 registers/thread (>25)

And with a block of size 17*18=306 threads/block. 8192/306=26,77 registers/thread (>25)

And with a block of size 18*18=324 threads/block (which is the limit from which i always have the error related to some lack of ressources) this means that I have 8192/324~=25,28 registers/thread. This is close to 25 but still above.

[b][i]So do you think that this is actually the reason why the kernel will not launch ?

Do you have an idea how the Profiler know that I need 25 registers to perform the computation ?

Last but not least : how is the # of registers used related to the program in itself ? (is it related to the # of blocks, or the size of blocks?)[/i][/b]

I’m open to any idea/suggestion !

It’s definitely because you run out of registers. Take a look at chapter 5.2 in the CUDA 2.1 programming guide. There’s the excact formula. For you numbers I got:

[codebox]Ceiling[25Ceiling[1717, 32], 8192/32] = 8192

Ceiling[25Ceiling[1718, 32], 8192/32] = 8192

Ceiling[25Ceiling[1818, 32], 8192/32] = 8960

[/codebox]

This matches perfect with what you see. When you get above 320 threads per block (~17.88^2) you do not have enough registers to run a thread block on the multiprocessor.

The profiler most likely hooks into the driver and asks for the size of the cubin that the GPU is currently running.

The number of registers is dependent on the complexity of you kernel. If your kernel is very long and with complex calculations/large amount of data register usage will increase.

I had seen this note in the programming guide… but I guess I’m too dumb to link the theory to the facts.
Thanks anyway, this was definitely my problem !