Large data size for cuFFT

wenyuan.wang · November 21, 2017, 9:14am

Hello,

Our research group has recently acquired a TITAN Xp GPU, and I have been trying to set it up for signal processing for a research project, which involves performing FFTs on very large 1-D arrays of input data (typically the size of N=10^7-10^8, or even larger). We’re using double-precision here as single-precision floats doesn’t provide enough accuracy for the application (despite that float_64 is quite slower than float_32 on GeForce GPU)

Still ,the performance is good and offers satisfactory speedup from the CPU implementation for data size N<=10^7. However, whenever my 1-D data size exceeds a given amount (~10^8, for double-precision floating point) the FFT kernel would fail to launch. I checked the cuFFT documentations, and it seems that cuFFT has a size limit on 1-D transforms, namely 64 million for single-precision and 128 million for double-precision.

I’m pretty new to cuFFT and still learning the library (so please forgive me if this sounds like a silly question). But I wonder if there is any way around the size limit of cuFFT? If not, is it possible to perform FFTs in parts? (or, would it be possible to map it to, say, some 2-D FFTs - which are batches of smaller 1-D FFTs - to get around the dimension limit?)

(Also, just another general cuFFT question: when benchmarking, I always see that my first call to CUDA is taking up more time than later calls. Is this normal due to initialization of the CUDA pipeline? If so, is there some way to pre-initialize, i.e. warm up, the card, so that each cuFFT call takes the same time? I’ve tried cudaSetDevice and cudaFree but unfortunately neither makes any difference.)

Thank you very much for your help!

Yours Sincerely,
Wenyuan

Robert_Crovella · November 21, 2017, 1:15pm

Can you point out in the CUFFT documentation where you are drawing this conclusion from:

I don’t think those limits are correct even for “ordinary” cufft calls, and cufft should support “large” transforms using the cufftMakePlanMany64() API:

http://docs.nvidia.com/cuda/cufft/index.html#unique_1744355423

If your GPU has enough memory, and depending on transform type and data type, this should handle transforms involving perhaps billions of elements. For a GPU with 8GB, it should be possible to handle a Z2Z transform of perhaps on the order of 250M elements (although I haven’t tried it - so maybe your 128M number is correct for a 8GB GPU and Z2Z). The principal limiter here will be memory size, not anything specific to the CUFFT API.

Yes, the first CUFFT calls involve library initialization. neither cudaSetDevice nor cudaFree have anything to do with CUFFT library initialization. If you want to fully initialize the library, run a small “warm-up” transform. The API calling sequence there should incur most of the overhead.

njuffa · November 21, 2017, 7:28pm

Sanity check: You are running on a 64-bit operating system and are building a 64-bit CUDA application, correct?

When a CUFFT API call fails, it should give you an error code. What is that error code in your case? If you don’t have proper error checking on all CUDA and GPU library API calls, the observed CUFFT failure could also be the result of a previous API call that failed.

Like txbob, I cannot find the claimed maximum size limits in the CUFFT documentation. Note that CUFFT needs a work area in addition to the input and output data and the cufftGetSize*() family of functions can tell you how much storage is required for the work area.

Depending on how the GPU is being used, it could also be the case that there are other memory allocations on the GPU (other than the ones used for the FFT) or that memory space has become fragmented. Note that if you are on Windows 10 with the default WDDM driver model, you will not be able to use more than about 81% of the total GPU memory (that’s a limitation of WDDM 2.0, best anyone can tell).

Are your FFT dimensions composed solely of powers of 2, 3, 5, and 7? That would be conducive to high performance; whether it also minimize memory usage during the FFT I cannot say, but it seems like a plausible assumption worth testing.

wusihai18 · August 28, 2018, 2:29pm

Hello,I have the same problem.I want to perform 3d-fft on the big size data . Even device memory is enough ,I failed to create cufftplan .any way to solve it ? Thank you

njuffa · August 28, 2018, 7:01pm

Which API call did you use? What error code or error message was returned?

How was it established there is enough device memory? What is your GPU, what is your operating system, and how big is the data set you are applying the FFT to?

CUFFT requires a work area in addition to storage for the date being transformed. Have you tried an appropriate cufftGetSize* call to get an accurate estimate how much space is needed? See:

https://docs.nvidia.com/cuda/cufft/index.html#work-estimate-refined

wusihai18 · August 29, 2018, 5:29am

njuffa,thank you for your help!
1.Which API call did you use? What error code or error message was returned?
I use cufftPlan1D to initialize a cuffplan object.it returns the error:CUFFT_ALLOC_FAILED.
2.device memory is enough
GPU:gtx1050Ti OS:win10.64bit
Length of data set is 20010241024.
After allocated device memory (210241024sizeof(cufftComplex)) 2 for input and output,it consume about 3.1G device memory ,I call cudaMemGetInfo to get the free memory.So I think residual of device memory is enough to apply transform.
It may be wrong when i see your reply,because I have ignored the work area size.
Before I post this problem,I have used cufftEstimate ,not cufftGetSize ,because cufftGetSize* needs a cuffplan object that is can not be initizled in my code.
In additional,“work area size” make confused.Why CUFFT requires work area in addition after data set have been storage in globale memory ?
Could you give me some advice to apply cufft the this data set? I have two gtx1050Ti GPU.Is there more better device that can run this transform on single device ?

njuffa · August 29, 2018, 7:13am

The GTX 1050 Ti comes with 4 GB of onboard memory. Since you use Windows 10 with a WDDM driver, about 81% of that, or 3.24 GB, are available to CUDA applications. If your inputs and outputs consume 3.1 GB as you state, that leaves very little memory, and it is not hard to imagine that it would be insufficient for the CUFFT work area.

I don’t have a good idea how the size of the CUFFT work area corresponds to the size of the data passed in. What I would do is start with smaller FFTs and use cufftGetSize on that. Then I would slowly increase the data set size in steps, while noting the work area size returned by cufftGetSize after each increase. Keep increasing the data set size until plan generation fails with an allocation failure. Presumably plotting work-area size vs input size should provide an interesting graph with reasonably predictive powers.

wusihai18 · September 1, 2018, 1:08pm

Thank you very much njuffa! Now I understood.

wusihai18 · September 8, 2018, 3:10pm

njuffa, there are some problem.I want to seek advice from you.In my tests,I find work area of cufft become larger than total memory of device.
Fortunately there is a solution for it-Unified Virtual Memory.In page 22 of cuFFT Library User’s Guide." In addition to the regular memory acquired with cudaMalloc, usage of CUDA Unified Virtual Addressing enables cuFFT to use the following types of memory as work area memory: pinned host memory, managed memory, memory on GPU other than the one performing the calculations."
So,I guess that I can allocate big managed memory or pinned host memory used for work area.But it failed.
In order to decrease need of problem.the data set is also stored in managed memory.I have run flowing code in GTX1050Ti and Tesla P100,the host with 28G memory.
I do not know why? Thank you!

cufftComplex* data;/*input*/
	cufftComplex* w;/*work area for cufft*/
	//local variable 
	size_t freeMem, totalMem;/*device information*/
	size_t datasize;
	size_t work_size;
	/*get memory before any opreation*/
	cudaMemGetInfo(&freeMem, &totalMem);
	printf("free= %f\n", freeMem / (1024.0 * 1024));
	/*allocate managed memory for input*/
	size_t Nx = 800, Ny = 300, Nt = 2048;
	datasize = Nx*Ny*Nt * sizeof(cufftComplex);
	Check(cudaMallocManaged((void**)&data, datasize));
	for(int i=0;i<datasize/sizeof(cufftComplex);i++){
		data[i] = make_cuComplex(i,i*i);
	}
	/*check memory*/
	cudaMemGetInfo(&freeMem, &totalMem);
	printf("free= %f\n", freeMem / (1024.0 * 1024));
	/*allocate managed memory for cuff work area*/
	cufftEstimate3d(Nx, Ny, Nt, CUFFT_C2C,&work_size);
	printf("work_size %f\n", work_size / (1024.0 * 1024));
	Check(cudaMallocManaged((void**)&w, work_size));
	//Check(cudaHostAlloc((void**)&w, work_size, cudaHostAllocMapped));
	/*check memory*/
	cudaMemGetInfo(&freeMem, &totalMem);
	printf("free= %f\n", freeMem / (1024.0 * 1024));
	/*create plan for cufft & set work area*/
	cufftHandle plan, plan1;
	cufftCreate(&plan);
	cufftSetAutoAllocation(plan, false);
	cufftSetWorkArea(plan,w);
	/*check memory*/
	cudaMemGetInfo(&freeMem, &totalMem);
	printf("free= %f\n", freeMem / (1024.0 * 1024));
	/*initialize plan */
	cufftMakePlan3d(plan, Nx, Ny, Nt, CUFFT_C2C,&work_size);
	/*check memory*/
	cudaMemGetInfo(&freeMem, &totalMem);
	printf("free= %f\n", freeMem / (1024.0 * 1024));
	/*excute  fft transform*/
	cufftResult_t fftret = cufftExecC2C(plan, data, data, CUFFT_FORWARD);
	cudaDeviceSynchronize();
	fftret = cufftExecC2C(plan, data, data, CUFFT_INVERSE);
	cudaDeviceSynchronize();
	/*check memory*/
	cudaMemGetInfo(&freeMem, &totalMem);
	printf("free= %f\n", freeMem / (1024.0 * 1024));
	printf("free= %f\n", freeMem / (1024.0 * 1024));
	printf("%f\n", data[2].x);
	printf("%f\n", data[2].y);

Topic		Replies	Views
Multiple batches of 1D FFT using cuFFT GPU-Accelerated Libraries	10	5122	October 29, 2019
CUFFT_INTERNAL_ERROR during creation of a 1D Plan in CUFFT GPU-Accelerated Libraries cuda , cufft	11	3833	October 19, 2022
Trouble with cuFFT on multiple GPUs GPU-Accelerated Libraries	13	3708	August 26, 2017
cufftXt batch 1D GPU-Accelerated Libraries	12	2178	October 15, 2019
cufftGetSize1d fails with a CUFFT_ALLOC_FAILED error GPU-Accelerated Libraries cufft	5	662	April 12, 2023
cufftPlan2d fails CUDA Programming and Performance	14	21026	September 17, 2007
allocation problem in cuFFT CUDA Programming and Performance	2	2555	September 16, 2009
size limit of 1D FFT CUDA Programming and Performance	8	2527	September 24, 2011
cufftEstimate*() memory consumption GPU-Accelerated Libraries	3	468	June 29, 2019
Questions about cuFFT for 3D matrix, arrayFire GPU-Accelerated Libraries	5	1665	October 12, 2021

Large data size for cuFFT

Related topics