The grid size is limited to 65535 per dimension. In my code, I am hitting this limitation quite frequently and would like to know what the best way to work around it is. Should I try and split the kernel into multiple calls or is there some way to work around this limitation?
So currently, my code does something as follows:
[codebox]
dim3 B2(256,1,1);
dim3 G2(Grid_size,1,1); // The Grid_size can exceed beyond the allowed limit…
MyKernel<<<G2, B2>>>();
[/codebox]
My Grid_Size can extend beyong the 65535 limit/dimension. I do not really know how to seamlessly take advantage of the other dimensions. When I just try to set them to some other number, my kernel times out…
It is explained in Section 2.2 of the programming guide, but the premise is exactly the same as column major ordered storage in arrays. The ID of a thread within a 3D block of dimensions(Dx,Dy,Dx) is:
Just make sure, that when you make this change, to never ever refer to blockIdx.x anymore, but rather use your variable gIdx1.
Sometimes it can be a pain, if you have a device function somewhere which can be called from two kernels launched at different configuration settings.
Personally, I try to avoid using higher dimentions - even if it is a natural way to do so! - for exactly these reasons.
Rather than using two dimensional grids you could have the threads in your kernel iterate. This example works for arbitrarily large N no matter what the block dimensions are.
[codebox]
global
void set_to_zero(float * ptr, unsigned int N)
{
const unsigned int grid_size = blockDim.x * gridDim.x;
unsigned int i = blockIdx.x * blockDim.x + threadIdx.x;
while(i < N)
{
ptr[i] = 0.0f;
i += grid_size;
}
}
[/codebox]
FWIW this is the strategy used in Thrust's algorithms.
The only time I’ve hit the max grid dimension I tried both the above methods and found that using a 2D grid was faster. I guess it depends a bit on your algorithm.