I have a Tesla C2070 at compute 2.0, which has 14 SM at 32 SP/SM, so in total there are 448 CUDA cores.
Assuming I have a large enough problem and threads, at any given time, how many concurrent threads are running on the GPU card? If a warp is scheduled on each SP, then the total threads are 32 * 448 = 14336. I just want to confirm if this is the right estimation.
What’s maximum limit of number of blocks can be scheduled on a SM?
Instruction execution is pipelined in various functional units (such as SP units) so multiple instructions (from multiple different warps) can be in the pipeline at any given time.
At any given issue slot on a Fermi 2.0 SM, two warp instructions can be scheduled to begin, due to the Fermi hotclock arrangement (a warp instruction is scheduled on 16 cores across 2 hotclocks in order to schedule the full 32 threads of the warp). In that issue slot, only half of the threads in each of 2 warps begin executing. The other half of the threads begin executing in the next hotclock cycle.
32 * 448 is not the correct calculation. A SP unit handles instructions for one thread in any given clock cycle, or pipeline stage.
It’s reasonable to say that 32*14 = 448 threads can have an instruction begin execution, in any given clock cycle. The total number of concurrent threads running in any given clock cycle will be a function of what is in the various pipelines of the various functional units.
The maximum number of blocks that can be resident on an SM is in the documentation:
[url]http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#features-and-technical-specifications[/url] (table 12)
Up through cc 2.x it is 8. Kepler generation (3.x) bumps this to 16. Maxwell generation (5.x) bumps this to 32.