Blocks and Threads

Hi,
I’m new to the cuda programming and I was wonder about the different between blocks and threads.

When creating a function in cuda what consideration needs to make about the ratio between blocks and threads.In other words when is best to use more threads per blocks .And want to understand the benefits in High and Low ratio threads per blocks.

for example lets say you created two add vector functions which one signature is addVector<<<N,1>>> ,And the other signature is addVector<<<1,N>>> .So when is best to use the first one over the other second one or is it always better use only one of them in any scenario .

Thanks in advanced ,
Gal

Generally speaking you’ll want to have a constant block size (some multiple of the warp size, 32), then calculate the number of blocks you need to have a sufficient number of total threads to handle your problem. Exactly how many this will be depends on your problem and how you’re approaching it.

For example, if you have some [n x 1] vector, and you want 1 thread per element, and you decide on a block size of [64 x 1 x 1], then you’ll want to have a [ceil(n / 64) x 1 x 1] grid size.

As for what the optimal block size is, that also depends on your algorithm. You want enough blocks to have enough occupancy for latency hiding, but if your blocks are too large, then you might actually have the opposite effect (i.e. if you have have 1536 threads/MP, then if you have a block size of 1024, then you’ll only be able to fit 1 block/MP, for a total of 1024 threads/MP, while if you had a block size of 256, then you could theoretically have 6 blocks/MP, for a total of 1536 threads/MP). Of course there are other factors involved in how many blocks/MP you can have, such as the amount of shared memory each block wants, so there are other considerations to take into account.

You can take a look at this post for some code and good information:
http://stackoverflow.com/questions/5810447/cuda-block-and-grid-size-efficiencies