How blocks will be distributed among SPs ?

Say, I have 30 blocks with 32 threads in each block, registry and shared memory requirements of each thread are minimal so 8 blocks may be put into one SM.

Considering GTX 280 - will these 30 blocks be distributed among 30 SMs or only four SMs will be involved (1st - 8 blocks, 2nd - 8 blocks, 3rd - 8 blocks, 4th - 6 blocks) ?

Is it controllable from the programmer’s side ?

It’s not controllable from the programmer’s side, and to be honest I don’t know how it schedules.

This question has practical importance … when specifying the parameters of the run with input data that is not big enough to fill all SMs with any combination of block/grid size, which option to choose ? Many blocks with small number of threads or not so many blocks with maximal possible number of threads in each block ?

My observations on my tasks give the second approach (max threads/less blocks) more scores, but it is very task-dependent I believe.

The total number of threads is what impacts memory latency hiding. To hide register dependencies, you need 192 or more threads per multiprocessor, as suggested by the programming guide. So, it really depends on how the time of your kernel is balanced between the memory and arithmetic operations.

While threadblock scheduling and assignment to multiprocessors is not defined (i.e. any order and assignment is correct), you can spread threadblocks across the multiprocessors with a simple “hack”:

  • a multiprocessor has 16KB of smem available;
  • you can control how many threadblocks run per multiprocessor by forcing occupancy with smem requests (easily done with the third argument of a kernel launch). For example, if you request >8KB, then only one threadblock will be assigned per multiprocessor.

Paulius

Brilliant, I did not think about such a trick :-) Thank you.