I’m trying to use cooperative groups on at 1080 (Pascal). Everything seems to work well until I start asking for more and more threads. I’m getting the following error while running in debug mode.
CUDA error at ../src/Particle_BPF_GPU.cu:660 code=82(cudaErrorCooperativeLaunchTooLarge) "cudaLaunchCooperativeKernel ( ( void* ) computeResampleIndex, gridSize, blockSize, params, 0, NULL )" ========= Program hit cudaErrorCooperativeLaunchTooLarge (error 82) due to "too many blocks in cooperative launch" on CUDA API call to cudaLaunchCooperativeKernel. ========= Saved host backtrace up to driver entry point at error ========= Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so.1 [0x32f753] ========= Host Frame:./Filters [0xab9e9] ========= Host Frame:./Filters [0x58513] ========= Host Frame:./Filters [0x38daf] ========= Host Frame:./Filters [0x2de8c] ========= Host Frame:./Filters [0x2216a] ========= Host Frame:./Filters [0x1b155] ========= Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main + 0xf0) [0x20830] ========= Host Frame:./Filters [0xc589] ========= ========= ERROR SUMMARY: 1 error nicelyma@wd-nicelyma-lnx:~/cuda-workspace/Filters/Debug$ cuda-memcheck ./Filters -s 1 --mcs=1 --gpu -b 30848
I understand I’m using too many ‘active blocks’ and have no argument with that. What I don’t understand is how to do the math to know how many blocks and threads I can call beforehand.
The kernel is only using 32 registers and the card has 20 SMPs. I’m asking for 128 threadsPerBlock and cudaOccupancyMaxActiveBlocksPerMultiprocessor says I have 12 ‘active blocks’ available.
Also, it does work when I ask for 30720 threads.The math is just not clicking…