Tesla C1060 Max blocks per Streaming Multiprocessor

Hello could anyone tell me

  1. What the Max number of Blocks for one streaming multiprocessor(SM) for tesla c1060?
  2. What the Max number of threads for one streaming multiprocessor(SM) for tesla c1060?


Appendix A in programming guide

The maximum number of active blocks per multiprocessor is 8
The maximum number of active threads per multiprocessor is 1024

The 2nd parameter (active threads) depends on the compute-capability of the card.

hello. could somebody tell me if i understood numbers about tesla c1060?

1 tesla = 240 cores (core = SP)
8 cores = 1 SM (SM = block)
1 tesla = max. 30 SMs (blocks)
1 SM = max. 512 threads
1 tesla = max. 512 * 30 = 15360 threads
1 grid = max. 65535 blocks … if this is correct, how do i separate 15360 threads in 65535 blocks?
32 threads = 1 warp
1 SM = 512 threads = 16warps … how does this work parallel if there is only 8 SPs in SM?

i thought i understood issues about grids/blocks/threads, but now, after what i read in a book about cuda “… each SM can only accommodate up to 8 blocks …”, i am pretty distracted :teehee: (because i considered SM = block). please, explain me what is wrong. thanks!

I think that you misunderstand the term “SM” and “block”
SM = stream multiprocessor, which has 8 SPs ( stream processor, or “core”)
block = thread-block, which can contain 512 threads at most

one SM can have
(1) 8 active thread-block at most AND
(2) 1024 active threads at most
the term “simultaneously” means that one SM can manage its resources to these threads,
but SM has only 8 cores, it cannot execute 1024 threads at the same time.

each 3 SMs has a thread-block scheduler, which would assign thread-block (in waiting queue) to a SM which has free resources for a thread-block.
and a SM has a warp scheduler to select a warp (32 threads) to be executed in 8 SPs.

one SM needs 4 cycles to execute a command of a warp. Physically you should think that one SM execute one warp at the same time.

The TESLA C1060 has ‘240 cores’. So, does this mean that it can run coresblocksPerSMthreadsperBlock=(240/8)8512=122,880 threads concurrently?

it depends on what is the definition of “run threads concurrently”.

I will define “run threads concurrently” by

“number of threads can be scheduled at the same time (not executed at the same time).”

for compute capability 1.3 (for example, TeslaC1060), we have

(1) maximum number of active threads per SM is 1024 and

(2) number of SM is 30

Hence there are 30 x 1024 = 30720 threads run concurrently.

Note: conditions, “maximum number of active threads per SM is 1024” and “one SM can be scheduled 8 thread-block at most”

must be satisfied simultaneously. for example, if one tread-block has 512 threads, then there are only two thread-blocks

in one SM at most.

Good explanation, thanks.

But I’m still curious as to how many threads are actually executed simultaneously?

Physically speaking, one SM can execute one warp (32 threads) without dual issue.

TeslaC1060 has 30 SMs, that means that 30 warps can be executed simultaneously.

can i add one more question?

Why there is such a hardware limitation as maximum resident blocks per SM?

Still cannot google it.

Because more resident blocks require additional hardware, and there appears to be little gain past 8 blocks/SM which could justify the additional spending.

i understand that it is hardware limitation, but i was wondering like what it is exactly =) why its not possible to have 16 resident blocks with 32 threads per block, but it is possible to have 8 resident blocks with 64 threads per block.

so, as i uderstand, switching between warps from different blocks is more expensive that switching warps from the same block?

surely, it’s not real case, just interested =)

The scheduler works more efficient if blocks have an even number of warps, so 64 threads per block is kind of a lower limit for the useful block size. And 8*64=512 threads per SM already is in the ballpark of where one wants to be, so it may not be worthwhile to increase the number of resident blocks per SM.

Not that I have full insight into the consideration, or understand why Nvidia found it more worthwhile to increase the number of cores per SM to a register-bandwidth-starved 48…

seems like even number of warps works better because of double instruction fetch for 2.0 ( and 2x2 for 2.1 )

Thanks, seems legit =)

Even for compute capability 1.x devices even numbers of warps work better because of register banking issues (this is not officially documented apart from the fact that blocksizes which are multiples of 64 work better :smile: ).