Hi!
I am using the cufftDx library. In order to get higher occupancy, I reduced the number of registers used per thread (with launch_bounds).
It did worked for getting higher occupancy but I did not get better performance, probably because using less registers results in less efficient FFT.
Maybe some cufftDx-trick can help to avoid this trade-off?
Thanks