Mapping of Blocks to MPs / Threads to MPs

Hi everybody,

I am running my CUDA code on a GeForce 210 ( and I was wondering about the mapping of blocks to multi processors or better threads to multi processors:

1.) Running deviceQuery revealed that the maximum number of blocks per grid (one-dimensional grid) is 65535. So if I start a kernel with the maximum number of blocks ( kernel <<<65535, X>>> (param1, param2, …, paramN); ) how are they mapped to the MPs? I read that there might be 8 blocks at max concurrently being processed by 1 MP.

2.) If I start a kernel with the maximum number of blocks and a blocksize of 1 (kernel <<<65535, 1>>> (param1, param2, …, paramN); ) does it mean, that internally, a block will not care about the block size of 1 and run 32 threads anyway and just disregard the calculations of the other 31 threads?

I am pretty new to CUDA so sorry if the answers to my questions seem obvious.

  1. Each MP can run more than 1 block , but totally they have a max number of threads which can be active. For cc 2.0 is 1536 while for 3.x is 2048. so for the cc 2.0 one would get higher occupancy in som ecases by using the blocksize 512 ( 3 blocks active) as opposed to a block size of 1024 and only 1 block active per MP.

For my codes I just changed the block size anc chosed the one which was faster.
2) Block with 1 thread will still behave the same as a blocks with 32. The thread object is only a representation of what happens at hardware level. It is like an assembly line which has only 1 object on it.