Optimal threads vs blocks

cazint · April 22, 2011, 12:50pm

Hi all,

Please know that I first googled this question extensively and read several search results on this forum. But I’m not sure what is the current/updated answer, as the architecture has changed so dramatically over the years, especially with the advent of FERMI.

I used to have an old GT 8600 with 4 SM (yeah, ouch!). At that time, I was playing around with CUDA a bit and trying to learn some basics. This was about two years ago. At that time, someone said GPUs reach peak efficiency at 20,000 threads…meaning, try not to launch more than that. If you have more “stuff” to do in your code, write for loops, allowing these 20,000 threads to iterate and keep working.

Fast forward 2 years later to now.

Reading the CUDA by Example book and the programming guide, I’m seeing examples of launching millions of threads.

For example, I have a FERMI-based GTX 460, which has 336 cores. Assuming I want each core to be busy, I need 336 blocks. And I read that 256 threads per block is ideal (err…at least it was 2 years ago). So that equates to more than 80 thousand threads.

So with the advent of FERMI, what is best here? My guess is that I want to use each and every core. My naive algorithm was previously launching upwards of 1 million threads (or a LOT more). Two years ago, someone mentioned that I should launch a much smaller amount, closer to 10,000 or so, and have those threads iterate over for loops.

But what if I only launch 10,000 threads on my GTX 460. At a block size of 256, that’s only 40 cores being used! …out of 336…so clearly not taking advantage of this robust card.

Bottom line, I’m trying to find a (current/updated) resource that explains this in detail for the small people like me. If someone asked me to guess, I would think, with my card, 80 thousand threads would be the MINIMUM I would want to launch, thereby using all 336 cores and maintaining the 256 threads per block.

Is there a section of the programming guide or some other (updated and current) resource that explains this?

Thanks in advance.

cricri1 · April 22, 2011, 1:46pm

http://developer.download.nvidia.com/compute/cuda/3_2_prod/sdk/docs/CUDA_Occupancy_Calculator.xls

not too bad

tera · April 22, 2011, 2:24pm

A block is executed by an SM, not by a core.

The fundamental relation between the number of cores and the minimum number of threads is that a core has about 24 cycles of latency, so you want 24 threads per core to fully hide arithmetic latency. For your GTX 460, this equates to 24Ã—336=8064 threads.

There is nothing wrong though with launching more threads than this to improve hiding of memory latency and better scale to different devices, so 20.000 threads is actually on the lower side, and 100.000 to 1.000.000 threads are absolutely in the ballpark.

cazint · April 22, 2011, 9:34pm

Thanks tera for the detailed response.

For my GTX 460, there are 7 SMs, each with 48 cuda cores. On another thread (The Official NVIDIA Forums | NVIDIA) you posted that you want at least 576 threads on each SM for GPUs with compute capability 2.0.

So if I make blocksize of 256 and schedule 10,000 threads to launch, this results in 40 blocks. Couple questions:

(1) So these 40 blocks are spread amongst the 7 SMs?

(2) That adds up to about 6 blocks per SM. And each SM has 48 cuda cores. So therefore, can you confirm that it should NOT be my concern to “fill” each cuda core. Meaning, I need not worry about the cuda cores?

My original post was arguing that with 40 blocks, not many of the 336 cores would be used (based on the invalid assumption that each block executes on a cuda core). So I just want to confirm that I need only worry about making enough threads per block, and then launching enough blocks to make sure enough threads are on each SM. How those 48 cores are used is not for me to be concerned with.?.

Thanks.

tera · April 24, 2011, 2:19pm

Yes, in that thread I’ve used the fact that on compute capability 2.x latencies seem to be closer to 18 cycles, while for a general rule of thumb I’m using the larger 24 cycles value, as aiming for a few more threads does not hurt.

Yes.

Yes, as long as you keep the blocksize a multiple of 64 you should not worry about individual cores (the unit of scheduling is a warp anyway). Note that the 6 blocks per SM might not be able to run concurrently due to other constraints (number of registers, shared memory, max. number of warps).

In general you might want to have a few more blocks per SM to limit the effect of the final imbalance when some of the blocks have finished execution already while others have not.

Yes.

Topic		Replies	Views
Maximum number of threads How to find maximum number of threads your Card can support CUDA Programming and Performance	16	10250	July 7, 2009
Cuda Cores Cuda Cores - run threads bloocks, kernels etc. CUDA Programming and Performance	5	1727	February 22, 2011
How to use "block" and "thread" CUDA Programming and Performance	5	1255	October 16, 2013
More blocks than SMs may not make sense CUDA Programming and Performance	13	2630	November 11, 2010
Multiprocessors or Cuda Cores CUDA Programming and Performance	25	19414	July 5, 2011
Basic Cuda Confusion - help CUDA Programming and Performance	9	1894	February 11, 2013
GPU: Blocks, Threads, Multiprocessors, and Cuda Cores clarification Help clarifying the terms CUDA Programming and Performance	6	21285	November 9, 2011
kernel function efficiency CUDA Programming and Performance	9	2155	June 28, 2012
Registers per SM GTX 460 CUDA Programming and Performance	7	1912	April 17, 2011
Confusion about setting kernel block and grid size for maximum occupancy CUDA Programming and Performance cuda	11	585	March 30, 2024

Optimal threads vs blocks

Related topics