Memory allocation control for cufftXt

Hello,

I would like to compute FFTs on a 2^14x2^14 2d array in cuDoubleComplex that takes 4GB of memory. I have two Quadro M5000 with 8 GB each that can communicate with each other.

I set the GPUs

cufftXtSetGPUs(plan_multi, nGPUs, whichGPUs);

I create a plan with

cufftMakePlan2d(plan_multi, 16384, 16384, CUFFT_Z2Z, worksize);

This crates a work area of ~ 6GB on EACH of the devices and when I try to allocate my variable

cudaLibXtDesc *d_u;
cufftXtMalloc(plan_multi, &d_u, CUFFT_XT_FORMAT_INPLACE);

there is no space.

Can the situation be improved and the cufftXt be set to use smaller workarea? The documentation points to cufftXtSetWorkArea(), but this just says I have to make sure there is enough memory allocated for the workarea.

Further, can I split my input array across the GPUs by setting the data pointers in the descriptor to places in the different devices (as far as I understand this is how you tile in cuBLAS)?

I couldn’t find anything illuminating in the documentation, so any reference is welcome.

What I have managed to do so far is to move the input array into one of the devices, set a once-device plan on the other device (that takes a bit more than 4GB) and compute through the P2P. This is somehow not entirely satisfactory.

Thanks a lot

Update:

So, I have tried to compute in single precision instead of double. I have a PDE solver running on one of the Quadros that takes 6 GB while running: 2GB for the variable 2GB for an auxiliary variable (the PDE has a nonlocal term) and 2GB workspace for the cufft.

Changed the code for two GPUs using cufftXt and suddenly I am out of memory - each of the above entities now takes 3GB on EACH of the devices.

What am I doing wrong?