Grid dimensions

Greetings to all.

I am new to cuda and trying to get my feet wet. I’ve actually gotten my kernels to work, but yesterday, while testing various sizes of my input data, I ran into some some complications which I’ve not yet been able to solve up till now.

Actually, the problem arose when I tried changing my grid from a one dimensional grid to a two dimensional grid. I needed to do this because my input data size required more than 65535 blocks, which is more than is allowed in gridDim.x or gridDim.y.


global void cuda_kernel(unsigned char *Arr, int depth, int oneDsize){

int thrdNr = (blockIdx.x * gridDim.x + blockIdx.y) * blockDim.x + threadIdx.x;

int element = thrdNr * byteDepth;

if (element + depth - 1 < oneDsize){

		Arr[(element + 2)] = 0;

		Arr[(element + 1)] = 0;



[codebox]int main(int argc, char** argv){

//code for copying data to Arr on graphic card ommited for clarity

			int block_size = 3;

			int n_blocks = 12288000;

			int gridY = n_blocks/cudaDeviceProperties.maxGridSize[0] + (n_blocks%cudaDeviceProperties.maxGridSize[0] == 0 ? 0:1);  



			dim3 dimGrid(cudaDeviceProperties.maxGridSize[0], gridY);



			int depth, size;

                            depth = 4;

                            size = 147456000;

			cuda_filter_blue <<< dimGrid,  block_size>>>  (devptr, depth, size); 



			cudaError_t err= cudaGetLastError();


When I run this program, I get a cudaErrorLaunchFailure and I can’t seem to find out why. I noticed that when I change the line in the kernel from “dim3 dimGrid(cudaDeviceProperties.maxGridSize[0], gridY);” to “dim3 dimGrid(gridY, cudaDeviceProperties.maxGridSize[0]);”, that the kernel was able to be launched, but did not perform the desired computations.

I’ll greatly appreciate it if someone could help me find out why this doesn’t work. Thanks.

Got it. Had to change my thread indexing to

int thrdNr = (blockIdx.x * gridDim.y + blockIdx.y) * blockDim.x + threadIdx.x …

Glad you got it working. Rather big performance mistake you should be aware of though: you seem to have only three threads per block, whereas you should have at least 32, preferably more.


The parallel execution on a GPU runs in warps of 32. This means that 32 threads actively process in parallel on each multiprossor, while the others idle, covering up memory latency and suchlike. As the minimum parallel unit of threads is 32, launching a block with fewer than 32 threads (or, for that matter, anything not a multiple of 32 threads) wastes resources as the unallocated threads within a warp will do nothing (ie. launching 3 threads per block wastes more than 90% of the available power!).

In your case, you want far fewer thread blocks, and far more threads in each thread block. It’ll go a lot faster! I find that 128 is a nice number, though it varies a bit on your application.

Hope this helps!

I use defines to avoid ‘32’ appearing in different places (contexts) and also improve readability & maintainability
makes it easier when testing performance of different blocksizes i.e. “threads per block”

//#define MAXPERSEARCHTHREAD 32 // 16 works better, get 3 blocks per MP instead of 1, THREADSPERBLOCK also reduced from 1024
#define OVERLAP 32
#define INSTREAMS 4
// Can change INSTREAMS

---- eg this —
for ( int tt = 1; tt < SEARCHTHREADS; tt++)
used = used + shCounts[tt-1];
if ( threadIdx.x < MAXPERSEARCHTHREAD )
val = shPosns[tt*MAXPERSEARCHTHREAD + threadIdx.x];


— is preferable to —
for ( int tt = 1; tt < 32; tt++)
used = used + shCounts[tt-1];
if ( threadIdx.x < 32)
val = shPosns[tt*32+ threadIdx.x];


something wrong in your index of threads

"int thrdNr = (blockIdx.x * gridDim.x + blockIdx.y) * blockDim.x + threadIdx.x; "

it should be corrected as

unsigned int thrdNr = (blockIdx.y * gridDim.x + blockIdx.x) * blockDim.x + threadIdx.x;

according to your setting,

int n_blocks = 12288000;				

int gridY = n_blocks/cudaDeviceProperties.maxGridSize[0] + (n_blocks%cudaDeviceProperties.maxGridSize[0] == 0 ? 0:1);  	

dim3 dimGrid(cudaDeviceProperties.maxGridSize[0], gridY);

your grid has dimension (65535, 188)

if using orignal index

"int thrdNr = (blockIdx.x * gridDim.x + blockIdx.y) * blockDim.x + threadIdx.x; "

then maximum index occurs at (blockIdx.x , blockIdx.y) = (65535, 188), this means

thrdNr = ( 65535 * 65535 + 188 )*3 + threadIdx.x

However ( 65535 * 65535 + 188 )*3 > 2^31 (int has maximum value 2^31), this means that thrdNr is negative (overflow)

now if using another index-map

"int thrdNr = (blockIdx.x * gridDim.y + blockIdx.y) * blockDim.x + threadIdx.x … "

then maximum index occurs at (blockIdx.x , blockIdx.y) = (65535, 188),

thrdNr = ( 65535 * 188 + 188 )*3 + threadIdx.x

this is safe, not overflow, but it does not sweep all indices

since according to

“unsigned int thrdNr = (blockIdx.y * gridDim.x + blockIdx.x) * blockDim.x + threadIdx.x;”

maximum index occurs at (blockIdx.x , blockIdx.y) = (65535, 188), then

thrdNr = ( 188 * 65535 + 65535 )*3 + threadIdx.x

Thanks for your explicit reply, LSChien.

I forgot to mention earlier that to get it to work that way, I had changed the grid dimensioning to “dim3 dimGrid(gridY, cudaDeviceProperties.maxGridSize[0]);” which makes it more or less the same as your version :)

Hi Tigga, thanks for the pointers.

Actually, I was trying to measure the effects of various input data sizes and various thread block sizes. So I varied the input data size, and for each data input size, I used various block sizes ranging from 3 to 512 and timed the effects. The error I mentioned above occured after I had changed my grid from one dimensional to two dimensional and wanted to start measuring the timing again, so that was basically the first block size I was testing, thus the size 3 :)

Curiously though, before, while I was using just a one dimensional grid, I measured the best performances when my block sizes were 6, 7 and 8. And according to my understanding of what I read in the programming guide, which also corresponds to what you explained above, this shouldn’t be the case. I don’t yet know why I was getting peak performances from these block sizes.

And finally, I have a question concerning my dimensioning of the grid. I’m wondering whether the dimensions of the grid can have an effect on kernel performance. Taking the example above;

  • The grid has dimensions (65535, 188)

  • so it can accomodate 65535 * 188 = 12320580 blocks

  • but my input data needs 12288000 to be fully processed

  • that means there are 12320580 - 12288000 = 32580 blocks not processing any data

According to my understanding, the threads in these blocks also execute the instructions in the kernel, they just fail the condition “if (element + depth - 1 < oneDsize)” in the kernel, and thus do not go on to do any processing of the input data.

What I’m wondering is, do so many “idle” thread blocks have an adverse effect on the overall performance of the kernel? I would imagine they do but need confirmation of this. And if that’s the case, which is the most suituable way of dimensioning the grid, if one is not aware of the input data size at compile time? I didn’t have this problem before, as I was using a one dimensional grid and my input data never required more than 65535 blocks.

Thanks for all the help.