cufft release 11.8 added the new known issue:
‣ Performance of cuFFT callback functionality was changed across all plan types and FFT sizes. Performance of a small set of cases regressed up to 0.5x, while most of the cases didn’t change performance significantly, or improved up to 2x. In addition to these performance changes, using cuFFT callbacks for loading data in out-of-place transforms might exhibit performance and memory footprint overhead for all cuFFT plan types and FFT sizes. An upcoming release will update the cuFFT callback implementation, removing the overheads and performance drops. cuFFT deprecated callback functionality based on separate compiled device code in cuFFT 11.4.
Whatever change this issue is referring to seems to cause extra device allocations that happen during each call to cufft. This in turn leads to unpredictable runtime when using the library. My application streams data through GPU operations and random spikes in runtime can cause us to have to drop data to keep up.
Should new development avoid using the cufft callback functionality? Is there a plan/timeline for remedying this issue?
The software is running on a 6000 Ada, so I can’t just use an older version of CUDA.