block numbers related to the number of SMs blocks in multiple SMs

Hi all,

While I was trying to write a cuda program, I got a question related to the number of blocks in multiple SMs.

From my understanding, each SM can have up to 8 blocks. With GeForce GTX 285, the number of multiprocessor is 30. Programming Guide says that a device with more multiprocessors will automatically execute a kernel grid in less time than a device with fewer multiprocessors. Does this mean that I can assign more than 8 blocks when I decide the number of blocks? For example,

[codebox]dim3 dimBlock(256);

dim3 dimGrid(16);[/codebox]

then the number of threads are 256*16=2048 and the number of blocks is 16. Both are more than the constraints, 1024 threads per a SM and 8 blocks per a SM. Will each 30 SM automatically take blocks, keeping the constraints? I am so confused… Could anyone help me to understand this?

Thank you for reading. I appreciate your time!

Chulho

You can have as many blocks as you need as long as you keep the grid size to less than 65335x65335. As long the per block resource requirements of your kernel and execution parameters don’t exceed the per multiprocessor limits described in Appendix A of the programming guide, the GPU will just keep running blocks until all are executed. All all blocks in a grid don’t have to run at the same time.

NVIDIA provide an occupancy calculation spreadsheet which lets you see what effect different kernel resource requirements and execution parameters will have on the occupancy of the GPU. There is a link in a sticky thread in the programming and development forum where you can download it, if you don’t already have a copy in the SDK.