Hi All,
I have a question related to CUDA in general, (but I also want to know if there is any help in CUDA 4.0)
I want to calculate 3D FFT (of very large order) using Multi GPUs(lets say 4 GPUs). So, my Idea is to first create 4 CPU openMP threads, divide and send the data on 4 GPUs, calculate 2D FFTs of the slices, then bring back the data to the CPU, then do transposition, again send the data to 4 GPUs and calculate 1D FFTs and then bring back the data to the CPU and do final transposition.
According to the above plan, on GPUs, I want to fork some threads, that would calculate 2D FFTs and 1D FFTs. But the problem is one cannot call CUFFT inside kernel function( CUFFT functions are callable from the host functions).
So, Any suggestion?
Thanks in advance for the reply.
I’ve been working on this topic for an year, I have a working code, but with concurrent copy and execute implementation, which is a little bit tricky.
Your plan is correct, but you don’t have to call cuFFT functions from a kernel itself. As a matter of fact, you don’t have to write any CUDA kernels at all! The cufft functions can be called only from the host machine, by cufftexecXYZ(cufft_plan,…) because this cufft function itself executes a parallel kernel on the device. You have to define only a batched 2D real-to-complex fft plan and a batched 1D complex-to-complex plan by cufftplanmany (for the forward FFT). Copy the slices to the GPUs, call the batched 2D, copy back, rearrange (transpose), copy again to the GPU, do the batched 1D, then copy back to the host again. Note that the x and z directions will be exchanged. That’s all.