I have a basic overlap save filter that I’ve implemented using cuFFT. My first implementation did a forward fft on a new block of input data, then a simple vector multiply of the transformed coefficients and transformed input data, followed by an inverse fft. The expected output samples are produced.
I performed some timing using CUDA events. I accumulated the time for the freq domain multiply and inverse fft, and averaged many runs (300-ish). For one block of input data it took about 62 us to processes.
In an attempt to improve the throughput of the filter, I decided to try the cuFFT callback feature. Now I call the inverse FFT with the transformed coefs as input, and in the input callback I multiply by the transformed input samples. The answer produced is mathematically correct, but it takes considerably longer: about 97 us per block.
Initially I blamed my callback code, so I created a callback that just returned the input data. Obviously this will produce the wrong output samples, but I was looking to establish a performance bound. It was no faster with this noop-callback. This seems to suggest that the callback itself is the source of the extra execution time.
I’ve tried this with one FFT frame at a time (cufftPlan1d), and I’ve tried it with multiple sets of filter coefs (cufftPlanMany with 100+ 1D FFTs) with similar results: mathematically correct output, but approximately 50% increase in processing time.
While talking with one of my colleagues recently, he mentioned a regression in cuFFT callback performance, but he did’t recall when it occurred.
Any suggestions for improving the throughput of the cuFFT callbacks?
Is there a known issue/regression in the cuFFT callback performance?