FFT in kernel on shared memory


My application has to process a 4 dimensional complex data structure of dimensions KxMxNxR, 7.5MB in size, in approximately 4.2ms. The computational steps involve several sequences of rearrangement, windowing and FFTs. Unfortunately my current code takes 15ms to execute, partly due to the fact that cufft is a host function which entails that all data have to remain global, hence costly memory access for rearrangements.

I would very much prefer if I could rearrange my input data inside a kernel into smaller R=60 “buckets” using shared memory, and then execute a “local” FFT from the kernel on data in shared memory. Could this be achieved with Vasily’s FFT, given that a bucket would be 128kB big? Or is there a way to retain shared memory between a kernel, a cufft and another kernel?

Thanks for hints,