CUDA maximum size of matrices on a 3GB RAM GPU


I am trying to run some program on Tesla M2050/2070 accelerators . My problem involves doing
inplace cufft calls for many times. During my code I allocate 1
pointer cufftHandle, 4 matrices double and 2 matrices
cufftDoubleComplex of size “totsize”. When I try to run my program it
crashes when the variable totsize is over 884736. The gpu on the
cluster have either 3 or 6 GB RAM, but the size of my matrices
combined is way below that is between 50 MB and 500 MB + the size
cufftHandel pointer. The maximum size I could for the 3D case of my problem was 96x96x96 which takes about 20 MB. This is extremely small I need at least 512x512x512 to make worth to buy a own Tesla card for our research.

Is this normal to happen when the cufft library is used or is it some problem with the ncc and the linux enviroment?

Late Edit: I had other errors in the code. The 2D size for my problem is up to 7000x7000 while for 3D is 384x384x384 for the 3GB RAM GPU.

Perhaps you have already solved your problem, but I have just some thoughts.

If your 2D problem size is 7000x7000, then you are allocating 7000700064(double)*8(4 double + 2 double complex)/8 bytes = 3.136 GB.
If your 3D problem size is 384x384x384, then you are allocating 3.624 GB.
In both the cases it seems you are trying to allocate more space than available on your cards.

Nevertheless, the cuFFT library User’s guide, see, says, with reference to the cufftEstimate2d() function, that

During plan execution, CUFFT requires a work area for temporary storage of
intermediate results. This call returns an estimate for the size of the work area required,
given the specified parameters, and assuming default plan settings.

Furthermore, with reference to the cufftGetSize2d() function, the guide says that

This call gives a more accurate estimate of the work area size required for a plan than
cufftEstimate2d(), given the specified parameters, and taking into account any plan
settings that may have been made.

Those two functions can allow to have an idea of the extra memory required by the cuFFT to store intermediate results.