Great question. The answer is it is not possible to synchronize all threads when the grid is larger than the maximum number of threads for full occupancy. Therefore, you must determine a grid size that will fit using the CUDA occupancy API and then launch the grid using cudaLaunchCooperativeKernel() (or cudaLaunchCooperativeKernelMultiDevice()), which returns an error if your grid size fails an occupancy check.
I am looking forward to more examples of cooperative thread groups. Hopefully there will be examples of how to deal with any different generation of card issues. I still have Kepler K40s, but would like to have understanding of differences so can run on Pascal or Volta cards once I can upgrade($$$). Some good examples of how to take advantage of Tensor cores directly would also be interesting. Possible dumb question: I have some sparse matrix projects that use cusparse,etc. I wonder if they would benefit having these tensor cores or it those things really prefer non-sparse?
LU factorization in GPUs?
The ballot function is applicable to a Cooperative Thread Group?
Yes. See https://docs.nvidia.com/cud...
You can create a thread_block_group of power of two size less than 32 and it exposes a .ballot() method (as well as shfl routines, any/all etc.)