I’m confused about the concepts of “maximum allowed threads per block” and “maximum allowed threads per SM”. Are they equivalent or not?
I learnt more than one blocks can be assigned to an SM if the other resources are allowed. So I don’t think they are equal.
Anyone has suggestions and/or comments to help me to clarify these concepts?
My question might be confusing as well. Let me make it more clear.
I can find out the “maximum allowed threads per block” by print “abc.maxThreadsPerBlock”. Could anyone tell me how to find out the “maximum allowed threads per SM” and “maximum blocks can be assigned to an SM” in the run time?
I was hoping somebody else could answer the question. Well since it’s been some time, I’ll just give some input anyway.
The question you are asking is trivial. Rarely do we get multiple blocks to run on the same SM concurrently. In practice we do not usually think for this case.
I am under the impression that my Visual Profiler shows a number for Maximum blocks per SM, which is 10. The card for which the figure is shown is a GTX 460.
I believe maximum number of threads per SM would also be 1024, the same as maxThreadsPerBlock.
To help you a bit more, it should be noted that in most cases, an SM does not run 2(or more) blocks concurrently. One block is finished before another block is scheduled to to start on the SM.
Why do you think so? For me it’s the other way around - only in very rare cases I run only one block per SM. If you check appendix F.1 of the (4.0) Programming Guide, the maximum number of threads per SM is 768/1024/1536 for sm_1[01]/sm_1[23]/sm_2x devices, which is always larger than the 512/1024 threads per block for sm_1x/sm_2x devices. The maximum number of blocks per SM is 8 blocks for all compute capabilities so far.
As for concurrent blocks on the same SM, I think it’s not decided by us. It’s decided by the driver or the hardware. I remember running 8 blocks of a single thread(or maybe a single warp…) on my GTX 460 yet in the end each SM only runs a single block at one time. Still remember the tests I did with my weird 460 ?
EDIT: sorry, just looked at that post again and realised I was using 16 warps. I will do with a single warp to see what happens. But we usually do not launch small blocks, do we?
I’m reading Hwu’s book. And in the end of Chapter 5, he mentioned several limitations of hardware resources. And also in chapter 4 of the book, it was mentioned that “Up to 8 blocks can be assigned to each SM in the GT200 design as long as there are enough resources to satisfy the needs of all of the blocks”. So I came up with the question.
To run more than one block per SM you of course need more blocks than SMs. Otherwise each block will be scheduled on it’s own SM to give maximal performance.
I agree with Tera… Rarely do I write code such that the entire SM is occupied by only 1 block at run-time. It all depends on the kind of kernels that you are developing though.
Sorry, that comment was meant toward hyqneuron, who ran 8 blocks on a GTX 460.
Looking at the thread again, I however realize that it has nothing to do with running bocks on an SM in parallel, it is only concerned with the overall distribution of blocks between SMs (whether executed serially or in parallel).