I’m working on an image processing project where there is a need to take the FFT (forward) and IFFT (inverse) of large images (>2MP) with some pre- and post-processing steps in between those FFTs. I know that cuFFT load/store callbacks can be used for processing images before and after a cuFFT execution call, thus reducing memory roundtrips (pretty important because I’m bandwidth-bound). My current pipeline looks like:
img -> processing (fwd cuFFT load cb) -> fwd cuFFT -> processing (fwd cuFFT store cb) -> inv cuFFT -> processing (inv cuFFT store cb) -> output
Is there a way to link the store callback of a forward cuFFT and the load callback of an inverse cuFFT so that the extra memory roundtrip between those two invocations of cuFFTs disappears? Like, a single invocation of cuFFT which starts AND ends in space domain (contains this whole pipeline, takes img as input and produces the output with respective kernels assigned to the callbacks). If not, is a feature similar to this planned in any way (regarding that this pipeline is actually a pretty standard one in image processing for filtering, etc.)?