I am trying to run CUFFT v4.1 in parallel over 4 GPUs (M2050s), and I have some questions about it:
I am dividing the data as NX(N/p) where p = num of gpus, and executing CUFFT on these chunks. According to my understanding, I need to perform the following steps for making FFT parallel:
1.1 Run 1d CUFFT on each row (on NN/p chunks on each GPU)
1.2 memcpy data back to host from p gpus, do a global transpose on the entire data, then send NN/p chunks back to p gpus
1.3 Run 1d CUFFT on each row (on N*N/p chunks on each GPU) - row because data is already contiguos as it is transposed
Am I correct in my understanding? Can I make use of 2d CUFFT for my purpose?
When I do the above, I notice the performance to be quite bad, I am getting around 10-12 GFlops per gpu, on a 4096X4096 image; whereas the single GPU version (which uses only one GPU for 2D CUFFT, implicit transpose) is some 80 GFlops. I want to know a nice efficient way to make CUFFT parallel, something that will enable me to transfer a large chunk of data to the gpu, and then I just run 1d CUFFTs on the data, avoiding 2*4096 memcpy statements.
You can issues the transform of several arrays in one call and you can transfer more than one line at the time using memcpy 2D.
The transposing part is not easy. It would help if you could do the transpoing without the intermidiate steps. this way you would have only gpu <–> gpu tranfers.
Are you saying about using batches of 1d FFT using cufftPlanMany(…)? I’ve noticed that GPUDirect doesn’t work if GPUs are not located in the same chipset, at least it doesn’t for the GPU node we have. But I still think it could be used for 2 GPUs, perhaps it could save a gpu-> cpu transfer if locally transposed. Thank you.
In the context of CUFFT 2D how could it be done? Having p 2D FFTs on (N*N/p) chunks wouldn’t suffice, so probably I need to think of ano other way. I wasn’t able to access the url you had put. Thank you.
Thank you for your advice…I see a large difference in my flop count now, my naive version which had 2*N Memcpys gave around 20 GFlops, whereas with (p batches * N cols) it is coming at around 180 GFlops! More performance could be squeezed if local transpose is done.