cuFFT callbacks slow

I have a basic overlap save filter that I’ve implemented using cuFFT. My first implementation did a forward fft on a new block of input data, then a simple vector multiply of the transformed coefficients and transformed input data, followed by an inverse fft. The expected output samples are produced.

I performed some timing using CUDA events. I accumulated the time for the freq domain multiply and inverse fft, and averaged many runs (300-ish). For one block of input data it took about 62 us to processes.

In an attempt to improve the throughput of the filter, I decided to try the cuFFT callback feature. Now I call the inverse FFT with the transformed coefs as input, and in the input callback I multiply by the transformed input samples. The answer produced is mathematically correct, but it takes considerably longer: about 97 us per block.

Initially I blamed my callback code, so I created a callback that just returned the input data. Obviously this will produce the wrong output samples, but I was looking to establish a performance bound. It was no faster with this noop-callback. This seems to suggest that the callback itself is the source of the extra execution time.

I’ve tried this with one FFT frame at a time (cufftPlan1d), and I’ve tried it with multiple sets of filter coefs (cufftPlanMany with 100+ 1D FFTs) with similar results: mathematically correct output, but approximately 50% increase in processing time.

While talking with one of my colleagues recently, he mentioned a regression in cuFFT callback performance, but he did’t recall when it occurred.

Setup:
Cuda 9.2
K5200
RHEL 6.7

Questions:
Any suggestions for improving the throughput of the cuFFT callbacks?
Is there a known issue/regression in the cuFFT callback performance?

Thanks,
Brett

Since I suspected the problem was with my own code, I decided to follow the example located here:
https://devblogs.nvidia.com/cuda-pro-tip-use-cufft-callbacks-custom-data-processing/

With this example code, the callback case is faster when the FFT size is kept at 1024. When I increased the FFT size to 64k, however, the callback case becomes slower than the non-callback case.

Here’s some data I’ve collected:

1k FFT, non-callback, .520 ms
1k FFT, callback, .300 ms

64k FFT, non-callback, 30 ms
64k FFT, callback, 40 ms

Why are the callbacks slower for the larger FFT size?

I managed to get my hands on an M4000 card, and tried my experiment on that card with similar results.

1k FFT, non-callback, .400 ms
1k FFT, callback, .300 ms

64k FFT, non-callback, 26 ms
64k FFT, callback, 42 ms

Correction to my setup: CUDA 9.1

I tried on an M6000 with CUDA 9.1 and CUDA 7.5 with similar results again.

1k FFT, non-callback, .24 ms
1k FFT, callback, .17 ms

64k FFT, non-callback, 13.6 ms
64k FFT, callback, 20.1 ms

I have found the problem. The example code is transposing the outputs, which means the output writes aren’t coalesced at all.