Cooperative Groups (Too many blocks in cooperative launch)

I’m trying to use cooperative groups on at 1080 (Pascal). Everything seems to work well until I start asking for more and more threads. I’m getting the following error while running in debug mode.

CUDA error at ../src/Particle_BPF_GPU.cu:660 code=82(cudaErrorCooperativeLaunchTooLarge) "cudaLaunchCooperativeKernel ( ( void* ) computeResampleIndex, gridSize, blockSize, params, 0, NULL )" 
========= Program hit cudaErrorCooperativeLaunchTooLarge (error 82) due to "too many blocks in cooperative launch" on CUDA API call to cudaLaunchCooperativeKernel. 
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so.1 [0x32f753]
=========     Host Frame:./Filters [0xab9e9]
=========     Host Frame:./Filters [0x58513]
=========     Host Frame:./Filters [0x38daf]
=========     Host Frame:./Filters [0x2de8c]
=========     Host Frame:./Filters [0x2216a]
=========     Host Frame:./Filters [0x1b155]
=========     Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main + 0xf0) [0x20830]
=========     Host Frame:./Filters [0xc589]
=========
========= ERROR SUMMARY: 1 error
nicelyma@wd-nicelyma-lnx:~/cuda-workspace/Filters/Debug$ cuda-memcheck ./Filters -s 1 --mcs=1 --gpu -b 30848

I understand I’m using too many ‘active blocks’ and have no argument with that. What I don’t understand is how to do the math to know how many blocks and threads I can call beforehand.

The kernel is only using 32 registers and the card has 20 SMPs. I’m asking for 128 threadsPerBlock and cudaOccupancyMaxActiveBlocksPerMultiprocessor says I have 12 ‘active blocks’ available.

Also, it does work when I ask for 30720 threads.The math is just not clicking…

There are multiple possible occupancy limiters. Register usage is not the only thing to consider (shared memory usage is another example). I would suggest you start by studying and using the cudaOccupancy… method that is given in the reduction…CG sample code. If that does not work, then probably a full example demonstrating the issue you are having may be needed.