registers occupancy and # of threads


Each of my threads has to do some computations on one 4x4 float matrix (1 matrix per thread). So I need 64B per thread (just for the matrices).

The number of 32bits-register is 8192 per multiprocessor, so I am limited to 512 threads only per MP!!! To optimize the processor occupancy, I am supposed to divide them in several blocks, at least 2… Finally, my number of threads per block seems ridiculous since I assume to need few other variables in my kernel. Is there any mistake?



  1. occupancy is not everything. More than 192 threads per MP is enough to hide the pipeline-depth.
  2. You can stage the data in shared memory
  3. Do you need to access elements more than once? If not, you probably don’t need a register for every element.

number of threads per MP is not the same as number of threads per block btw. You seem to mix the two.

1/- Ok, could you give me some more explanation about it?

2/- Sure, I can mix registers and SMEM

3/- Yes, I need

About the number of threads per MP, of course it is not the same that the number of blocks per MP. It is one reason of my post since to optimize the occupancy of processors, it is highly recommended to use at least 2 blocks per processor so to reduce the number of threads per block. I have some difficulties choose the better settings. I suppose it depends of the latency of my kernels and the number of diverging branchs.



In the programming guide it is stated that you need at least 192 threads per multiprocessor to hide read-after-write dependencies. It has to do with the pipeline depth of the processing cores.

For the rest there are some general guidelines that are good to live by if possible:

  • launch lots of blocks.
  • try to use template kernel functions which have the amount of threads per block as a template parameter. That way you can benchmark how many threads per block are optimal for your kernel (see reduction example in SDK)
  • use shared memory when appropriate.