I would like to compute FFTs on a 2^14x2^14 2d array in cuDoubleComplex that takes 4GB of memory. I have two Quadro M5000 with 8 GB each that can communicate with each other.
I set the GPUs
cufftXtSetGPUs(plan_multi, nGPUs, whichGPUs);
I create a plan with
cufftMakePlan2d(plan_multi, 16384, 16384, CUFFT_Z2Z, worksize);
This crates a work area of ~ 6GB on EACH of the devices and when I try to allocate my variable
cufftXtMalloc(plan_multi, &d_u, CUFFT_XT_FORMAT_INPLACE);
there is no space.
Can the situation be improved and the cufftXt be set to use smaller workarea? The documentation points to cufftXtSetWorkArea(), but this just says I have to make sure there is enough memory allocated for the workarea.
Further, can I split my input array across the GPUs by setting the data pointers in the descriptor to places in the different devices (as far as I understand this is how you tile in cuBLAS)?
I couldn’t find anything illuminating in the documentation, so any reference is welcome.
What I have managed to do so far is to move the input array into one of the devices, set a once-device plan on the other device (that takes a bit more than 4GB) and compute through the P2P. This is somehow not entirely satisfactory.
Thanks a lot