My application has to process a 4 dimensional complex data structure of dimensions KxMxNxR, 7.5MB in size, in approximately 4.2ms. The computational steps involve several sequences of rearrangement, windowing and FFTs. Unfortunately my current code takes 15ms to execute, partly due to the fact that cufft is a host function which entails that all data have to remain global, hence costly memory access for rearrangements.
I would very much prefer if I could rearrange my input data inside a kernel into smaller R=60 â€œbucketsâ€ using shared memory, and then execute a â€œlocalâ€ FFT from the kernel on data in shared memory. Could this be achieved with Vasilyâ€™s FFT, given that a bucket would be 128kB big? Or is there a way to retain shared memory between a kernel, a cufft and another kernel?
Thanks for hints,