Grid & Block dimensions

Hi

I have some rendering kernel called this way:

width = 320;
height= 240;

dim3 GridDimension = dim3(32, 16);

dim3 BlockDimension = dim3((width+GridDimension.x-1)/GridDimension.x,
(height+GridDimension.y-1)/GridDimension.y);

KDKernel<<<GridDimension, BlockDimension>>>((unsigned char *)surface,
(uint4 ) NodesMemory,
(unsigned int
) IndicesMemory,
(float4 *) TrianglesMemory);

And all works fine on my G80, but if i change GridDimension to be
dim3 GridDimension = dim3(16, 16);
My app crashes - why ?
(inside kernel i check to not overwrite memory - this is not the case)

How the addresses of global memory parameters are passed to the individual threads
in kernel ? Via shared memory ?

so in first case (grid = 32x16) i should need 10154*4=2400 bytes of shared memory

so in second case (grid = 16x16) i should need 20154*4=4800 bytes of shared memory

My app is x64 bit app, so those addresses are 8 bytes long each ?
or they are still 4 bytes long as GPU operates on 32bit address space - there is nothing on it in doc’s.

Even if they are 8 bytes lon each, my parameters should occupy 9600 of shared memory, so i still dont know why my kernel crashes :/
(inside kernel there is no any shared memory usage via shared directive)

any ideas ?

I don’t think you have calculate your shared mem usage in the way you do it. In my oppinion the parameters are passed only once to every block’s shared memory (because its shared) and not to each thread incividually (would be really bad of you could get problems by only passing 4 adresses and using 512 blocks…)

But maybe your register usage is too high. This would occur if there are enough registers in your kernel for 150 threads per block but for running 300 blocks in parallel you would need double the amount of registers.

I’m not really into that topic but I check every kernel with CUT_CHECK_ERROR(“some message”) and if it tells me “too many resources requested” I simply limit my number of threads for that kernel and therefore use more blocks.

Vrah

its not register usage since there is 13 regs per thread so 201513 = 3900, far less than 8k limit.
I’v found this crash is due to execution time per block (not per whole kernel)
probably vista or driver does something to not hung whole system when there is only
one GPU.

Another related question about the shared memory - if all threads use the same locations for params and nvcc gives me something like this:

Used 13 registers, 96+0 bytes lmem, 40+32 bytes smem, 144 bytes cmem[0], 40 bytes cmem[1]

and i want to use some amount of shared memory PER THREAD
(say i need array of 8 chars for each thread)
should i write:

extern shared char gShared;

global void Kernel(…)
{
char buff = &gShared[((blockDim.xthreadIdx.y)+threadIdx.x)*8];
}

OR:

extern shared char gShared;

global void Kernel(…)
{
char buff = &gShared[(40+32)+(((blockDim.xthreadIdx.y)+threadIdx.x)*8)];
}

In other words is compiller automatically adjust gShared offset to do not overwrite
parameters passed by shared memory, or it is up to me to adjust accordingly ?

http://forums.nvidia.com/index.php?showtopic=49835

This thread is containing a lot of info about the shared memory.

Short answer to your question: The adress of the shared memory exposed to you is always behind the last used space for parameters etc. so your gShared is completely free for you to use. So you can start to work on gShared[0].

thanks, now my kernel works fine,

The gotha is, altrought no offset is needed, you dont have 16384 bytes of shared mem
on your own :) (16 bytes or so is occupied by kernel internal guts like block size, etc.)
thats why when i used entire shared memory my kernel crashed :)