Large data size for cuFFT

Hello,

Our research group has recently acquired a TITAN Xp GPU, and I have been trying to set it up for signal processing for a research project, which involves performing FFTs on very large 1-D arrays of input data (typically the size of N=10^7-10^8, or even larger). We’re using double-precision here as single-precision floats doesn’t provide enough accuracy for the application (despite that float_64 is quite slower than float_32 on GeForce GPU)

Still ,the performance is good and offers satisfactory speedup from the CPU implementation for data size N<=10^7. However, whenever my 1-D data size exceeds a given amount (~10^8, for double-precision floating point) the FFT kernel would fail to launch. I checked the cuFFT documentations, and it seems that cuFFT has a size limit on 1-D transforms, namely 64 million for single-precision and 128 million for double-precision.

I’m pretty new to cuFFT and still learning the library (so please forgive me if this sounds like a silly question). But I wonder if there is any way around the size limit of cuFFT? If not, is it possible to perform FFTs in parts? (or, would it be possible to map it to, say, some 2-D FFTs - which are batches of smaller 1-D FFTs - to get around the dimension limit?)

(Also, just another general cuFFT question: when benchmarking, I always see that my first call to CUDA is taking up more time than later calls. Is this normal due to initialization of the CUDA pipeline? If so, is there some way to pre-initialize, i.e. warm up, the card, so that each cuFFT call takes the same time? I’ve tried cudaSetDevice and cudaFree but unfortunately neither makes any difference.)

Thank you very much for your help!

Yours Sincerely,
Wenyuan

Can you point out in the CUFFT documentation where you are drawing this conclusion from:

I don’t think those limits are correct even for “ordinary” cufft calls, and cufft should support “large” transforms using the cufftMakePlanMany64() API:

http://docs.nvidia.com/cuda/cufft/index.html#unique_1744355423

If your GPU has enough memory, and depending on transform type and data type, this should handle transforms involving perhaps billions of elements. For a GPU with 8GB, it should be possible to handle a Z2Z transform of perhaps on the order of 250M elements (although I haven’t tried it - so maybe your 128M number is correct for a 8GB GPU and Z2Z). The principal limiter here will be memory size, not anything specific to the CUFFT API.

Yes, the first CUFFT calls involve library initialization. neither cudaSetDevice nor cudaFree have anything to do with CUFFT library initialization. If you want to fully initialize the library, run a small “warm-up” transform. The API calling sequence there should incur most of the overhead.

Sanity check: You are running on a 64-bit operating system and are building a 64-bit CUDA application, correct?

When a CUFFT API call fails, it should give you an error code. What is that error code in your case? If you don’t have proper error checking on all CUDA and GPU library API calls, the observed CUFFT failure could also be the result of a previous API call that failed.

Like txbob, I cannot find the claimed maximum size limits in the CUFFT documentation. Note that CUFFT needs a work area in addition to the input and output data and the cufftGetSize*() family of functions can tell you how much storage is required for the work area.

Depending on how the GPU is being used, it could also be the case that there are other memory allocations on the GPU (other than the ones used for the FFT) or that memory space has become fragmented. Note that if you are on Windows 10 with the default WDDM driver model, you will not be able to use more than about 81% of the total GPU memory (that’s a limitation of WDDM 2.0, best anyone can tell).

Are your FFT dimensions composed solely of powers of 2, 3, 5, and 7? That would be conducive to high performance; whether it also minimize memory usage during the FFT I cannot say, but it seems like a plausible assumption worth testing.

Hello,I have the same problem.I want to perform 3d-fft on the big size data . Even device memory is enough ,I failed to create cufftplan .any way to solve it ? Thank you

Which API call did you use? What error code or error message was returned?

How was it established there is enough device memory? What is your GPU, what is your operating system, and how big is the data set you are applying the FFT to?

CUFFT requires a work area in addition to storage for the date being transformed. Have you tried an appropriate cufftGetSize* call to get an accurate estimate how much space is needed? See:

https://docs.nvidia.com/cuda/cufft/index.html#work-estimate-refined

njuffa,thank you for your help!
1.Which API call did you use? What error code or error message was returned?
I use cufftPlan1D to initialize a cuffplan object.it returns the error:CUFFT_ALLOC_FAILED.
2.device memory is enough
GPU:gtx1050Ti OS:win10.64bit
Length of data set is 20010241024.
After allocated device memory (210241024sizeof(cufftComplex)) 2 for input and output,it consume about 3.1G device memory ,I call cudaMemGetInfo to get the free memory.So I think residual of device memory is enough to apply transform.
It may be wrong when i see your reply,because I have ignored the work area size.
Before I post this problem,I have used cufftEstimate
,not cufftGetSize
,because cufftGetSize* needs a cuffplan object that is can not be initizled in my code.
In additional,“work area size” make confused.Why CUFFT requires work area in addition after data set have been storage in globale memory ?
Could you give me some advice to apply cufft the this data set? I have two gtx1050Ti GPU.Is there more better device that can run this transform on single device ?

The GTX 1050 Ti comes with 4 GB of onboard memory. Since you use Windows 10 with a WDDM driver, about 81% of that, or 3.24 GB, are available to CUDA applications. If your inputs and outputs consume 3.1 GB as you state, that leaves very little memory, and it is not hard to imagine that it would be insufficient for the CUFFT work area.

I don’t have a good idea how the size of the CUFFT work area corresponds to the size of the data passed in. What I would do is start with smaller FFTs and use cufftGetSize on that. Then I would slowly increase the data set size in steps, while noting the work area size returned by cufftGetSize after each increase. Keep increasing the data set size until plan generation fails with an allocation failure. Presumably plotting work-area size vs input size should provide an interesting graph with reasonably predictive powers.

Thank you very much njuffa! Now I understood.

njuffa, there are some problem.I want to seek advice from you.In my tests,I find work area of cufft become larger than total memory of device.
Fortunately there is a solution for it-Unified Virtual Memory.In page 22 of cuFFT Library User’s Guide." In addition to the regular memory acquired with cudaMalloc, usage of CUDA Unified Virtual Addressing enables cuFFT to use the following types of memory as work area memory: pinned host memory, managed memory, memory on GPU other than the one performing the calculations."
So,I guess that I can allocate big managed memory or pinned host memory used for work area.But it failed.
In order to decrease need of problem.the data set is also stored in managed memory.I have run flowing code in GTX1050Ti and Tesla P100,the host with 28G memory.
I do not know why? Thank you!

cufftComplex* data;/*input*/
	cufftComplex* w;/*work area for cufft*/
	//local variable 
	size_t freeMem, totalMem;/*device information*/
	size_t datasize;
	size_t work_size;
	/*get memory before any opreation*/
	cudaMemGetInfo(&freeMem, &totalMem);
	printf("free= %f\n", freeMem / (1024.0 * 1024));
	/*allocate managed memory for input*/
	size_t Nx = 800, Ny = 300, Nt = 2048;
	datasize = Nx*Ny*Nt * sizeof(cufftComplex);
	Check(cudaMallocManaged((void**)&data, datasize));
	for(int i=0;i<datasize/sizeof(cufftComplex);i++){
		data[i] = make_cuComplex(i,i*i);
	}
	/*check memory*/
	cudaMemGetInfo(&freeMem, &totalMem);
	printf("free= %f\n", freeMem / (1024.0 * 1024));
	/*allocate managed memory for cuff work area*/
	cufftEstimate3d(Nx, Ny, Nt, CUFFT_C2C,&work_size);
	printf("work_size %f\n", work_size / (1024.0 * 1024));
	Check(cudaMallocManaged((void**)&w, work_size));
	//Check(cudaHostAlloc((void**)&w, work_size, cudaHostAllocMapped));
	/*check memory*/
	cudaMemGetInfo(&freeMem, &totalMem);
	printf("free= %f\n", freeMem / (1024.0 * 1024));
	/*create plan for cufft & set work area*/
	cufftHandle plan, plan1;
	cufftCreate(&plan);
	cufftSetAutoAllocation(plan, false);
	cufftSetWorkArea(plan,w);
	/*check memory*/
	cudaMemGetInfo(&freeMem, &totalMem);
	printf("free= %f\n", freeMem / (1024.0 * 1024));
	/*initialize plan */
	cufftMakePlan3d(plan, Nx, Ny, Nt, CUFFT_C2C,&work_size);
	/*check memory*/
	cudaMemGetInfo(&freeMem, &totalMem);
	printf("free= %f\n", freeMem / (1024.0 * 1024));
	/*excute  fft transform*/
	cufftResult_t fftret = cufftExecC2C(plan, data, data, CUFFT_FORWARD);
	cudaDeviceSynchronize();
	fftret = cufftExecC2C(plan, data, data, CUFFT_INVERSE);
	cudaDeviceSynchronize();
	/*check memory*/
	cudaMemGetInfo(&freeMem, &totalMem);
	printf("free= %f\n", freeMem / (1024.0 * 1024));
	printf("free= %f\n", freeMem / (1024.0 * 1024));
	printf("%f\n", data[2].x);
	printf("%f\n", data[2].y);