confusion of basic concepts

xinwu · May 16, 2011, 8:21am

Hi, everyone!

I’m confused about the concepts of “maximum allowed threads per block” and “maximum allowed threads per SM”. Are they equivalent or not?

I learnt more than one blocks can be assigned to an SM if the other resources are allowed. So I don’t think they are equal.

Anyone has suggestions and/or comments to help me to clarify these concepts?

My question might be confusing as well. Let me make it more clear.

I can find out the “maximum allowed threads per block” by print “abc.maxThreadsPerBlock”. Could anyone tell me how to find out the “maximum allowed threads per SM” and “maximum blocks can be assigned to an SM” in the run time?

Thank you in advance!

hyqneuron · May 17, 2011, 4:12pm

I was hoping somebody else could answer the question. Well since it’s been some time, I’ll just give some input anyway.

The question you are asking is trivial. Rarely do we get multiple blocks to run on the same SM concurrently. In practice we do not usually think for this case.

I am under the impression that my Visual Profiler shows a number for Maximum blocks per SM, which is 10. The card for which the figure is shown is a GTX 460.

I believe maximum number of threads per SM would also be 1024, the same as maxThreadsPerBlock.

To help you a bit more, it should be noted that in most cases, an SM does not run 2(or more) blocks concurrently. One block is finished before another block is scheduled to to start on the SM.

tera · May 17, 2011, 4:35pm

Why do you think so? For me it’s the other way around - only in very rare cases I run only one block per SM. If you check appendix F.1 of the (4.0) Programming Guide, the maximum number of threads per SM is 768/1024/1536 for sm_1[01]/sm_1[23]/sm_2x devices, which is always larger than the 512/1024 threads per block for sm_1x/sm_2x devices. The maximum number of blocks per SM is 8 blocks for all compute capabilities so far.

hyqneuron · May 18, 2011, 3:46am

Sorry for the wring numbers given External Image

As for concurrent blocks on the same SM, I think it’s not decided by us. It’s decided by the driver or the hardware. I remember running 8 blocks of a single thread(or maybe a single warp…) on my GTX 460 yet in the end each SM only runs a single block at one time. Still remember the tests I did with my weird 460 ?

EDIT: sorry, just looked at that post again and realised I was using 16 warps. I will do with a single warp to see what happens. But we usually do not launch small blocks, do we?

xinwu · May 18, 2011, 7:43am

Tera, thank you!

I’m reading Hwu’s book. And in the end of Chapter 5, he mentioned several limitations of hardware resources. And also in chapter 4 of the book, it was mentioned that “Up to 8 blocks can be assigned to each SM in the GT200 design as long as there are enough resources to satisfy the needs of all of the blocks”. So I came up with the question.

tera · May 18, 2011, 9:27am

To run more than one block per SM you of course need more blocks than SMs. Otherwise each block will be scheduled on it’s own SM to give maximal performance.

Sarnath · May 18, 2011, 10:17am

I agree with Tera… Rarely do I write code such that the entire SM is occupied by only 1 block at run-time. It all depends on the kind of kernels that you are developing though.

xinwu · May 18, 2011, 12:55pm

Sure. I have a laptop with GeForce GT 525M, which has only 2 SMs.

tera · May 18, 2011, 7:16pm

Sorry, that comment was meant toward hyqneuron, who ran 8 blocks on a GTX 460.

Looking at the thread again, I however realize that it has nothing to do with running bocks on an SM in parallel, it is only concerned with the overall distribution of blocks between SMs (whether executed serially or in parallel).

Topic		Replies	Views
Scheduling Thread Blocks CUDA Programming and Performance	5	1239	July 29, 2021
More blocks than SMs may not make sense CUDA Programming and Performance	13	2685	November 11, 2010
Max threads/blocks CUDA Programming and Performance	10	90	September 6, 2024
Why is max threads per sm larger than max threads per block? CUDA Programming and Performance	3	1266	January 5, 2024
Number of blocks parameter for kernel when GPU has just one SM CUDA Programming and Performance	3	514	August 4, 2017
Shared Memory Buffer CUDA Programming and Performance	1	2689	May 13, 2011
Maximum number of threads How to find maximum number of threads your Card can support CUDA Programming and Performance	16	10269	July 7, 2009
Cuda Cores Cuda Cores - run threads bloocks, kernels etc. CUDA Programming and Performance	5	1757	February 22, 2011
Registers per SM GTX 460 CUDA Programming and Performance	7	1912	April 17, 2011
Thread Scheduling / Limit maximum threads per block in each dimension vs Maximum thread on a SM CUDA Programming and Performance	3	1760	June 21, 2012

confusion of basic concepts

Related topics