Occupancy and number of registers per thread trade-off in cufftDx

I am using the cufftDx library. In order to get higher occupancy, I reduced the number of registers used per thread (with launch_bounds).
It did worked for getting higher occupancy but I did not get better performance, probably because using less registers results in less efficient FFT.

Maybe some cufftDx-trick can help to avoid this trade-off?


By reducing the number of registers available, you are forcing the kernel to use local memory (resides in global). FFTs are memory bounds so it’s best to leave everything in registers. Where cuFFTDx excels is in allowing users to do more work in on-chip resources without having to return to global, or launch additional kernels.