FFT sub-areas of an image in parallel


I’m new to CUDA and working on an application which requires to process sub-areas of a large image. The image size is 4k x 4k. FFT and matrix multiplication routines are required to apply to every 256 x 256 region repeatedly. I’ve tried to use the cudaFFT to process each region in sequence. Due to the calling overhead and non-parallel processing on small area, the CUDA performance is not maximized.

Then, I tried to implement kernel functions to process the sub-areas in parallel, so that only a single high-level CUDA call will be made to process all regions. The ideal situation is to have the entire region in the shared memory of each block. I initialize the BLOCK_SIZE to 16 x 16 once. When the kernel function is called, it should load the region to the shared memory in each block. But I discovered that the maximum shared memory size is limited to 16kB, and it can only hold 32 x 32 float2 numbers.

Subsequently, I was looking into the “cufft_c2c_radix2.cu” function for hints and would like to use it to FFT the signals which contains in multiple blocks. It seems to me that the “cufft_c2c_radix2.cu” function is use to process 1D FFT. In addition, its operation is dependent on the size of the block and the size of the signal. Is that right? For this situation where I want to FFT a 2D signal with a fixed block size, is there a possible solution to make the parallel happens using CUDA?

Or am I on the wrong path and should try something else? Any suggestion would greatly appreciate!

Here’s something I wrote that might help: