cufft performance

Hello! I’m using cufft to calculate fft from each row of 2D image and (as profiler shows) occupancy is very low - 0.25. Is there a way to tune cufft of get it’s source code to optimize?

Occupancy is perhaps the least important facet of CUDA, and 0.25 is a perfectly good amount of it.

It’s a pet peeve of mine how the documentation and tools seem to emphasize it without really explaining how and when it matters.

But it’s not so good if to take 720 1-D cufft per image and each fft in profiler is:

GPU time: 18.5

CPU time: 32

Occupancy: 0.25

Looks like bad block size in cufft.

what size FFT? Are you doing the FFTs in batch?

1-D C2C FFT, size 1024, without batches.

With batches (2, 4 or 10) profiler shows the same numbers, just for 10 result image is incorrect.

Ok, thanks you!
Just noticed that with a big batch number there are coalesced operations for every fft while without batches only 1 from 10 fft shows gld/gst coalesced. So, performance gets better despite of occupancy 0.25 :)