I’m a little bit confused about the following:
The cudaDeviceProp documentation speaks about registers per BLOCK, and shared memory per BLOCK,
but all other sources, e.g. CUDA/OpenCL programming guide speak of registers per MULTIPROCESSOR ect.
Whats the right definition?
A MP can execute 8 Blocks at the same time, right?