While I was trying to write a cuda program, I got a question related to the number of blocks in multiple SMs.
From my understanding, each SM can have up to 8 blocks. With GeForce GTX 285, the number of multiprocessor is 30. Programming Guide says that a device with more multiprocessors will automatically execute a kernel grid in less time than a device with fewer multiprocessors. Does this mean that I can assign more than 8 blocks when I decide the number of blocks? For example,
then the number of threads are 256*16=2048 and the number of blocks is 16. Both are more than the constraints, 1024 threads per a SM and 8 blocks per a SM. Will each 30 SM automatically take blocks, keeping the constraints? I am so confused… Could anyone help me to understand this?
Thank you for reading. I appreciate your time!