hitting the grid size limitation

Hello,

The grid size is limited to 65535 per dimension. In my code, I am hitting this limitation quite frequently and would like to know what the best way to work around it is. Should I try and split the kernel into multiple calls or is there some way to work around this limitation?

So currently, my code does something as follows:

[codebox]

dim3 B2(256,1,1);

dim3 G2(Grid_size,1,1); // The Grid_size can exceed beyond the allowed limit…

MyKernel<<<G2, B2>>>();

[/codebox]

My Grid_Size can extend beyong the 65535 limit/dimension. I do not really know how to seamlessly take advantage of the other dimensions. When I just try to set them to some other number, my kernel times out…

Thanks for any help you can give me.

/x

It is explained in Section 2.2 of the programming guide, but the premise is exactly the same as column major ordered storage in arrays. The ID of a thread within a 3D block of dimensions(Dx,Dy,Dx) is:

dIdx = threadIdx.x + threadIdx.y*Dx + threadIdx.z*Dx*Dy

and the index of any given block in a grid of dimensions (Gx,Gy) is:

gIdx = blockIdx.x + blockIdx.y*Gx

therefore the “global” index of any thread is

Idx = dIdx + gIdx

That gives you 5126533565335 = 2185555059200 indices to work with.

Just make sure, that when you make this change, to never ever refer to blockIdx.x anymore, but rather use your variable gIdx1.
Sometimes it can be a pain, if you have a device function somewhere which can be called from two kernels launched at different configuration settings.
Personally, I try to avoid using higher dimentions - even if it is a natural way to do so! - for exactly these reasons.

Hello,

Thanks for the reply. Just a quick clarification though:

So, the thread ID in a 3D block can be gotten as:

[codebox]

const int tid = (threadIdx.x + threadIdx.yblockDim.x + threadIdx.zblockDim.x*blockDim.y);

[/codebox]

How do I get the dimensions of the grid though from a kernel? So, to get the index of a block I have:

[codebox]

const int bid_3D = blockIdx.x + blockIdx.y * (Grid_dimension_x);

[/codebox]

How do I get the dimensions of my grid from a kernel?

Many thanks,

/x

Rather than using two dimensional grids you could have the threads in your kernel iterate. This example works for arbitrarily large N no matter what the block dimensions are.

[codebox]

global

void set_to_zero(float * ptr, unsigned int N)

{

const unsigned int grid_size = blockDim.x * gridDim.x;

unsigned int i = blockIdx.x * blockDim.x + threadIdx.x;

while(i < N)

{

    ptr[i] = 0.0f;

i += grid_size;

}

}

[/codebox]

FWIW this is the strategy used in Thrust's algorithms.

The only time I’ve hit the max grid dimension I tried both the above methods and found that using a 2D grid was faster. I guess it depends a bit on your algorithm.