Optimal threads vs blocks

Hi all,

Please know that I first googled this question extensively and read several search results on this forum. But I’m not sure what is the current/updated answer, as the architecture has changed so dramatically over the years, especially with the advent of FERMI.

I used to have an old GT 8600 with 4 SM (yeah, ouch!). At that time, I was playing around with CUDA a bit and trying to learn some basics. This was about two years ago. At that time, someone said GPUs reach peak efficiency at 20,000 threads…meaning, try not to launch more than that. If you have more “stuff” to do in your code, write for loops, allowing these 20,000 threads to iterate and keep working.

Fast forward 2 years later to now.

Reading the CUDA by Example book and the programming guide, I’m seeing examples of launching millions of threads.

For example, I have a FERMI-based GTX 460, which has 336 cores. Assuming I want each core to be busy, I need 336 blocks. And I read that 256 threads per block is ideal (err…at least it was 2 years ago). So that equates to more than 80 thousand threads.

So with the advent of FERMI, what is best here? My guess is that I want to use each and every core. My naive algorithm was previously launching upwards of 1 million threads (or a LOT more). Two years ago, someone mentioned that I should launch a much smaller amount, closer to 10,000 or so, and have those threads iterate over for loops.

But what if I only launch 10,000 threads on my GTX 460. At a block size of 256, that’s only 40 cores being used! …out of 336…so clearly not taking advantage of this robust card.

Bottom line, I’m trying to find a (current/updated) resource that explains this in detail for the small people like me. If someone asked me to guess, I would think, with my card, 80 thousand threads would be the MINIMUM I would want to launch, thereby using all 336 cores and maintaining the 256 threads per block.

Is there a section of the programming guide or some other (updated and current) resource that explains this?

Thanks in advance.


not too bad

A block is executed by an SM, not by a core.

The fundamental relation between the number of cores and the minimum number of threads is that a core has about 24 cycles of latency, so you want 24 threads per core to fully hide arithmetic latency. For your GTX 460, this equates to 24×336=8064 threads.

There is nothing wrong though with launching more threads than this to improve hiding of memory latency and better scale to different devices, so 20.000 threads is actually on the lower side, and 100.000 to 1.000.000 threads are absolutely in the ballpark.

Thanks tera for the detailed response.

For my GTX 460, there are 7 SMs, each with 48 cuda cores. On another thread (http://forums.nvidia.com/index.php?showtopic=193556) you posted that you want at least 576 threads on each SM for GPUs with compute capability 2.0.

So if I make blocksize of 256 and schedule 10,000 threads to launch, this results in 40 blocks. Couple questions:

(1) So these 40 blocks are spread amongst the 7 SMs?

(2) That adds up to about 6 blocks per SM. And each SM has 48 cuda cores. So therefore, can you confirm that it should NOT be my concern to “fill” each cuda core. Meaning, I need not worry about the cuda cores?

My original post was arguing that with 40 blocks, not many of the 336 cores would be used (based on the invalid assumption that each block executes on a cuda core). So I just want to confirm that I need only worry about making enough threads per block, and then launching enough blocks to make sure enough threads are on each SM. How those 48 cores are used is not for me to be concerned with.?.


Yes, in that thread I’ve used the fact that on compute capability 2.x latencies seem to be closer to 18 cycles, while for a general rule of thumb I’m using the larger 24 cycles value, as aiming for a few more threads does not hurt.


Yes, as long as you keep the blocksize a multiple of 64 you should not worry about individual cores (the unit of scheduling is a warp anyway). Note that the 6 blocks per SM might not be able to run concurrently due to other constraints (number of registers, shared memory, max. number of warps).

In general you might want to have a few more blocks per SM to limit the effect of the final imbalance when some of the blocks have finished execution already while others have not.