Newbie help on thread blocks

I’ve tried reading the pertinent manual sections, but something is just not making its way through my thick skull… I can’t seem to wrap my mind around how to appropriately choose thread blocks for launching my kernel. This wasn’t an issue when I was testing my work with small data sets; however, I am now encountering an “invalid configuration argument” when trying to work with larger data sets.

I also tried looking at the CUDA Occupancy spreadsheet, but it might as well be written in Mandarin for all I can understand…

My current example attempts to work on a set of 1D float arrays which are 19,695,844 elements in length.

I was following someone else’s example, launching the kernel:

int block_size = 128;
//nArrayElements = 19,695,844
int n_blocks = nArrayElements/block_size + (nArrayElements%block_size == 0 ? 0 : 1 );

PerformWork<<< n_blocks, block_size >>>(…)

Can anyone point in me the right direction to launch my kernel in a better way?


CUDA supports grid dimensions only up to 65535, which you’re exceeding.

Easy solution, have each thread process more than one value from your array.
A 2D grid would also get you high enough.

Are there any tutorials out there on 2D (and 3D?) grids out there? Will a 2D grid give you 2x the size of a 1D grid or is it more?

I know that my end application will go much larger than this, so I will definitely need to find the appropriate information…

I appreciate the help. Thanks!

int block_size = 128;

int max_blocksX=65535; //maximum gridDim, according to dev guide

//nArrayElements = 19,695,844

int n_blocksX = (nArrayElements-1)/block_size+1;

int n_blocksY=(n_blocksX-1)/max_blocksX+1;

PerformWork<<<dim3(n_blocksX,n_blocksY), block_size >>>(...)

this one scales to 2^31-1, then you’ve got a problem with the int. :-p

use something like threadIdx.x + blockIdx.xblockDim.x + blockIdx.yblockDim.x*gridDim.x to get the absolute index in every thread.

You’re basically just wrapping a 1D array in a 2D one. It’s a small trick in addressing to use 2D. You can convert a 1D address into a 2D one, and a 2D address back into a 1D one using some simple math. However, I think Ocire made a mistake. 1D->2D conversion is:

int n_blocks = (nArrayElements-1) / block_size  +  1;

int n_blocksX = (n_blocks >= max_blocksX) ? max_blocksX : n_blocks;

int n_blocksY = (n_blocks-1) / max_blocksX  +  1;

Then, when inside the kernel, you must do:

int block = blockIdx.x + blockIdx.y*gridDim.x;

if(  block >= n_blocks  )


Actually, Mandarin’s not a written language.

OK… It’s starting to sink in now. :)

That got me past the current error. Now, I’m on to a new one!

Thanks! :-)

Looking ahead, I’m still trying to figure out the best way to launch my kernel in general. I am starting with a specific example to try to figure it all out…

The documentation shows

DoWork<<< Dg, Db >>>(…)


Dg = Dim3 such that Dg.x*Dg.y = number of blocks

Db = Dim3 such that Db.xDb.yDb.z = number of threads per block

Another section tells me that I should have at least twice as many blocks as multiprocessors (16 in my case), but 100 is better and 1000 will scale better in the future. The number of threads should be a multiple of the warp size (32 in my case). I am examining the particular case of an array that is sized 10,475,400. I started by looking at the block sizes that would give me between 100 and 1,000 blocks. I saw that selecting Db by:

Db = 512 x 128 x 1 = 65,536 = 2,048*32

Would give me ~160 blocks => 10,475,400/(512*128) = 159.84.

Thus, I chose Dg:

Dg = 160 x 1

Launching my kernel:

Dim3 Dg( 160, 1 );

Dim3 Db(512, 128);

DoWork<<< Dg, Db >>>(...)

gives me an “invalid configuration argument” error.

Where did I go wrong in my reasoning?

The maximum number of threads per block is 512, so Db.xDb.yDb.z must be <= 512. Note that depending on the number of registers used in your kernel, the limit may be lower.

I always choose dimensions the other way around. Fix the block size (benchmark different block sizes in multiples of 32 to find the optimum). For 1D data, it is easiest just to keep the block size 1D. Then I launch enough blocks of that size to cover the whole data set.

In your case, say you benchmark and find that a block size of 128 is the fastest. Then ceil(10,475,400 / 128) = 81 839. That is larger than 65535, so: ceil(sqrt(81839)) = 287. You need to launch a kernel with a 287x287 grid with a block size of 128.

Note that the sqrt might not be the best way to choose the 2D block size, I haven’t thought that one through fully yet. So far my data sets have always fit nicely with only a 1D grid.

Well, I feel like an idiot… I even spent a little time today writing the card properties (including maximum threads per block) to a diagnostics file. I definitely should have caught that.

That being said, I am getting very slow times when I take this approach. My baseline is to use 512 threads and a block size of ceil(10,475,400 / 512) = 20,460. This gives me a run time of 3.3 minutes (a speed up of less than 5x - another problem I am working on). I have played around with different block sizes and grid dimensions as per your suggestion. Any 2D value I use seems to yield a run time of around 11 minutes (give or take 15 seconds) - over 3x the original run time.

Any suggestions?

I wonder if part of my problem might be in calculating the correct index into the 1D array from the 2D blocks. I’m using the following:

int nIndex = (blockIdx.x * blockDim.x + threadIdx.x)*gridDim.y + blockIdx.y * blockDim.y + threadIdx.y;

Going column-down will destroy coalescing. You should be going row-right.

Holy crap did that make a difference! That shaved 27 seconds off of my original time. I am now down to 2:52. This was a great step in the right direction.


Oops. Spoke too soon… MY run time was better, but the answers were wrong. I interpreted your post to mean I should change my index calculation to:

int nIndex = blockIdx.x * blockDim.x + threadIdx.x + (blockIdx.y * blockDim.y + threadIdx.y)*gridDim.x;

Was I reading your suggestion correctly?

You don’t have to overcomplicate the code by using threadIdx.y, just leave the blockDims 1D.
Apart from that, your correction was effective, as now consecutive threads will read consecutive parts of global memory, which allows for coalescing. (see programming guide for more information)
Provided you chose the threadIdx.x nicely (e.g., a multiple of the warp size, i.e. 32*x), you should also not have a problem with alignment.

ps: sorry for my mistake in the first post, don’t know why i didn’t see the missing min(…) ;-)


I see what you mean now. I’ve got it working with your suggestion. Unfortunately, I am back to the 3.3 minute timing. At least the answers are correct though. :)

Time to move on with optimization…


Well, at least you’re not getting 11 minutes anymore?

What does your kernel do?

I can’t talk specifically about what I’m doing at the moment; however, it is quite similar to the fluid dynamics example in the GPU Gems book. Sorry I can’t be more detailed right now… :-/

Oh, and to answer your other question. No, I’m not seeing the 11 minute times anymore. I’m currently at 3.3 minutes and holding. I would really like to get it down to a minute or less…

just some things to check, maybe it’ll give you the missing seconds:
does each thread only read a single value from global mem? if not, it would be advisable to use shared mem or const/texture mem.
is your code divergent within a warp?
have you chosen the optimal blockSize according to the occupancy calculator? (you can also just test different ones)
are you using many registers in your kernel? can you reduce them?
are you using “expensive” functions like sinf, cosf, expf, etc.? there exist lower precision alternatives, which are much faster (they usually are something like __cosf etc., also make sure you are using the single precision functions if you compile for sm_1.3)
look for things like 1.f/sqrtf(x), which will cost much more than rsqrtf(x) and do the same.
look for constant values which are computed the same way in every thread.

the performance guidelines in the programming guide are a good way to start and well worth reading. there is also information about the runtime of most functions.

Assuming you are bandwidth limited (most CUDA apps are), you should compute the effective bandwidth you are getting by counting up the total number of reads and writes your kernel does and dividing by the run time of the kernel. On GTX 280, you should be able to get ~110 GiB/s for fully coalesced reads.

As others have said, getting coalesced reads is very important. The difference in performance can be a factor of 10-20. The visual profiler can be used to verify that your reads are coalesced, but only on pre-G200 hardware.

If you can give us at least some idea of the memory access pattern of the kernel, we can suggest uses of constant, texture, and/or shared memory to boost performance. If telling us that would give too much away, you’ll just have to learn how and when to use them on your own. The “performance guidelines” section of the CUDA programming guide is the best place to start. Sites with additional information are linked to from the CUDA 2.1 FAQ: