I would like to use cooperative groups, and grid sync function, with maximum possible number of blocks.
As I know from documentation If block will not be currently managed by SM it will make impossible for kernel to execute.
Now number of blocks deployed depends on amount of required shared memory and required warps, but also (I suppose) othe activities on GPU.
Considering all those 3 things how to calculate (presumably using occupancy API) maximum number of blocks .
For example I want to have blocks with 512 threads and 8000 bits of shared memory each - and now calculate in runtime how many blocks can i fire to fit them in a grid.