Tesla Fermi card thread scheduling

Zhou_En · August 14, 2014, 9:27pm

Hi All,

I have a Tesla C2070 at compute 2.0, which has 14 SM at 32 SP/SM, so in total there are 448 CUDA cores.

Quesiton 1:
Assuming I have a large enough problem and threads, at any given time, how many concurrent threads are running on the GPU card? If a warp is scheduled on each SP, then the total threads are 32 * 448 = 14336. I just want to confirm if this is the right estimation.

Question 2:
What’s maximum limit of number of blocks can be scheduled on a SM?

Thanks.

Robert_Crovella · August 14, 2014, 9:44pm

Instruction execution is pipelined in various functional units (such as SP units) so multiple instructions (from multiple different warps) can be in the pipeline at any given time.

At any given issue slot on a Fermi 2.0 SM, two warp instructions can be scheduled to begin, due to the Fermi hotclock arrangement (a warp instruction is scheduled on 16 cores across 2 hotclocks in order to schedule the full 32 threads of the warp). In that issue slot, only half of the threads in each of 2 warps begin executing. The other half of the threads begin executing in the next hotclock cycle.

32 * 448 is not the correct calculation. A SP unit handles instructions for one thread in any given clock cycle, or pipeline stage.

It’s reasonable to say that 32*14 = 448 threads can have an instruction begin execution, in any given clock cycle. The total number of concurrent threads running in any given clock cycle will be a function of what is in the various pipelines of the various functional units.

The maximum number of blocks that can be resident on an SM is in the documentation:

[url]http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#features-and-technical-specifications[/url] (table 12)

Up through cc 2.x it is 8. Kepler generation (3.x) bumps this to 16. Maxwell generation (5.x) bumps this to 32.

Topic		Replies	Views
How many parallel threads? CUDA Programming and Performance	19	10000	October 1, 2021
Partitioning CUDA Programming and Performance	0	1997	October 6, 2011
Tesla C1060 Max blocks per Streaming Multiprocessor CUDA Programming and Performance	14	10532	November 30, 2011
Tesla C2050/GTX 470 limits? CUDA Programming and Performance	19	18703	June 6, 2010
how many threads concurrently run at a clock? CUDA Programming and Performance	3	1427	April 15, 2009
Thread Scheduling Concept CUDA Programming and Performance	3	3721	June 21, 2012
How many thread are executed at the same time ? CUDA Programming and Performance	9	7888	January 21, 2024
Scheduling Thread Blocks CUDA Programming and Performance	5	1204	July 29, 2021
How the number of concurrent threads is calculated? CUDA Programming and Performance	5	6056	November 26, 2011
How do CUDA cores on a SM execute warps concurrently? CUDA Programming and Performance	8	28712	July 4, 2019

Tesla Fermi card thread scheduling

Related topics