Per Block/Multiprocessor


I’m a little bit confused about the following:

The cudaDeviceProp documentation speaks about registers per BLOCK, and shared memory per BLOCK,
but all other sources, e.g. CUDA/OpenCL programming guide speak of registers per MULTIPROCESSOR ect.

Whats the right definition?

A MP can execute 8 Blocks at the same time, right?

Running multiple blocks per SM can only decrease the available resources per block. So the maximum number of registers / shared memory size per multiprocessor and per block are identical. The per block maxima are only achievable if only one block is running per multiprocessor.

So the expression “per multiprocessor” is right.

When running only one Block Per MP you can reach a maximum occupancy of 66,7% since MPs can run 768/1536 Threads, but Blocks can only contain 512/1024 Threads