Tesla C1060 Max blocks per Streaming Multiprocessor

ilovei7 · March 15, 2010, 7:07am

Hello could anyone tell me

What the Max number of Blocks for one streaming multiprocessor(SM) for tesla c1060?
What the Max number of threads for one streaming multiprocessor(SM) for tesla c1060?

Thanks!

LSChien · March 15, 2010, 8:13am

Appendix A in programming guide

The maximum number of active blocks per multiprocessor is 8
The maximum number of active threads per multiprocessor is 1024

Sarnath · March 15, 2010, 8:34am

The 2nd parameter (active threads) depends on the compute-capability of the card.

staf · March 17, 2010, 11:34pm

hello. could somebody tell me if i understood numbers about tesla c1060?

1 tesla = 240 cores (core = SP)
8 cores = 1 SM (SM = block)
1 tesla = max. 30 SMs (blocks)
1 SM = max. 512 threads
1 tesla = max. 512 * 30 = 15360 threads
1 grid = max. 65535 blocks … if this is correct, how do i separate 15360 threads in 65535 blocks?
32 threads = 1 warp
1 SM = 512 threads = 16warps … how does this work parallel if there is only 8 SPs in SM?

i thought i understood issues about grids/blocks/threads, but now, after what i read in a book about cuda “… each SM can only accommodate up to 8 blocks …”, i am pretty distracted External Image (because i considered SM = block). please, explain me what is wrong. thanks!

LSChien · March 18, 2010, 1:50am

I think that you misunderstand the term “SM” and “block”
SM = stream multiprocessor, which has 8 SPs ( stream processor, or “core”)
block = thread-block, which can contain 512 threads at most

one SM can have
(1) 8 active thread-block at most AND
(2) 1024 active threads at most
simultaneously.
the term “simultaneously” means that one SM can manage its resources to these threads,
but SM has only 8 cores, it cannot execute 1024 threads at the same time.

each 3 SMs has a thread-block scheduler, which would assign thread-block (in waiting queue) to a SM which has free resources for a thread-block.
and a SM has a warp scheduler to select a warp (32 threads) to be executed in 8 SPs.

one SM needs 4 cycles to execute a command of a warp. Physically you should think that one SM execute one warp at the same time.

zeus13i · March 18, 2010, 5:57am

The TESLA C1060 has ‘240 cores’. So, does this mean that it can run coresblocksPerSMthreadsperBlock=(240/8)8512=122,880 threads concurrently?

LSChien · March 18, 2010, 7:26am

it depends on what is the definition of “run threads concurrently”.

I will define “run threads concurrently” by

“number of threads can be scheduled at the same time (not executed at the same time).”

for compute capability 1.3 (for example, TeslaC1060), we have

(1) maximum number of active threads per SM is 1024 and

(2) number of SM is 30

Hence there are 30 x 1024 = 30720 threads run concurrently.

Note: conditions, “maximum number of active threads per SM is 1024” and “one SM can be scheduled 8 thread-block at most”

must be satisfied simultaneously. for example, if one tread-block has 512 threads, then there are only two thread-blocks

in one SM at most.

zeus13i · March 19, 2010, 12:22am

Good explanation, thanks.

But I’m still curious as to how many threads are actually executed simultaneously?

LSChien · March 19, 2010, 4:27am

Physically speaking, one SM can execute one warp (32 threads) without dual issue.

TeslaC1060 has 30 SMs, that means that 30 warps can be executed simultaneously.

polotenchiko · November 28, 2011, 7:34pm

can i add one more question?

Why there is such a hardware limitation as maximum resident blocks per SM?

Still cannot google it.

tera · November 28, 2011, 11:37pm

Because more resident blocks require additional hardware, and there appears to be little gain past 8 blocks/SM which could justify the additional spending.

polotenchiko · November 29, 2011, 11:16am

i understand that it is hardware limitation, but i was wondering like what it is exactly =) why its not possible to have 16 resident blocks with 32 threads per block, but it is possible to have 8 resident blocks with 64 threads per block.

so, as i uderstand, switching between warps from different blocks is more expensive that switching warps from the same block?

surely, it’s not real case, just interested =)

tera · November 29, 2011, 12:47pm

The scheduler works more efficient if blocks have an even number of warps, so 64 threads per block is kind of a lower limit for the useful block size. And 8*64=512 threads per SM already is in the ballpark of where one wants to be, so it may not be worthwhile to increase the number of resident blocks per SM.

Not that I have full insight into the consideration, or understand why Nvidia found it more worthwhile to increase the number of cores per SM to a register-bandwidth-starved 48…

polotenchiko · November 30, 2011, 12:47am

seems like even number of warps works better because of double instruction fetch for 2.0 ( and 2x2 for 2.1 )

Thanks, seems legit =)

tera · November 30, 2011, 2:02am

Even for compute capability 1.x devices even numbers of warps work better because of register banking issues (this is not officially documented apart from the fact that blocksizes which are multiples of 64 work better External Image ).

Topic		Replies	Views
Partitioning CUDA Programming and Performance	0	1997	October 6, 2011
Amount of Shared Memory CUDA Programming and Performance	10	4209	June 3, 2010
Multiprocessors, Cores, Threads and Parallelism CUDA Programming and Performance	5	13823	November 8, 2010
confusion of basic concepts CUDA Programming and Performance	8	6307	May 18, 2011
Dumb hardware question CUDA Programming and Performance	5	1509	December 21, 2009
More blocks than SMs may not make sense CUDA Programming and Performance	13	2676	November 11, 2010
About the SM multiprocessor processing the wraps in block Legacy PGI Compilers	1	2576	May 31, 2019
Scheduling Thread Blocks CUDA Programming and Performance	5	1203	July 29, 2021
Tesla Fermi card thread scheduling CUDA Programming and Performance	1	805	August 14, 2014
Relation between SM and block CUDA Programming and Performance	1	5590	March 18, 2010

Tesla C1060 Max blocks per Streaming Multiprocessor

Related topics