I have a question related to CUDA in general, (but I also want to know if there is any help in CUDA 4.0)
I want to calculate 3D FFT (of very large order) using Multi GPUs(lets say 4 GPUs). So, my Idea is to first create 4 CPU openMP threads, divide and send the data on 4 GPUs, calculate 2D FFTs of the slices, then bring back the data to the CPU, then do transposition, again send the data to 4 GPUs and calculate 1D FFTs and then bring back the data to the CPU and do final transposition.
According to the above plan, on GPUs, I want to fork some threads, that would calculate 2D FFTs and 1D FFTs. But the problem is one cannot call CUFFT inside kernel function( CUFFT functions are callable from the host functions).
So, Any suggestion?
Thanks in advance for the reply.