CUDA enabled cuFFT slower than general purpose FFTW


Today I ported my code to use nVidia’s cuFFT libraries, using the FFTW interface API (include cufft.h instead, keep same function call names etc.)

What I found is that it’s much slower than before:

  • 30hz using CPU-based FFTW
  • 1hz using GPU-based cuFFTW

I have already tried enabling all cores to max, using:

  • nvpmodel -m 0

The code flow is the same between the two variants. Could the speed difference be due to data alignment for SIMD?

  • FFTW memory allocation was done using fft_malloc_complex
  • cuFFTW memory allocation was done using regular C++ array initialization of Complex datatypes


Could you lock frequency to the max and try it again?

sudo ~/


Thanks for the suggestion.
I’m now up to 2hz, instead of 1.

However, I was at 30 before with the regular CPU-based FFTW.

I have reworked my code to use the native cufft function calls thusly:

cudaMalloc((void**)&FFTout, sizeof(cufftComplex)*(BINSIZE/2 +1));        
cudaMalloc((void**)&FFTin, sizeof(cufftReal)*(BINSIZE));        

cufftPlan1d(&planFFT,BINSIZE, CUFFT_R2C, 1);

cudaMemcpy(FFTin, data, BINSIZE*sizeof(cufftReal), cudaMemcpyHostToDevice);
cufftExecR2C(planFFT, FFTin, FFTout);				
cudaMemcpy(&output, FFTout, (BINSIZE/2+1)*sizeof(cufftComplex), cudaMemcpyDeviceToHost);

I am now observing 3hz total speed.
My best guess is that it’s due to the extra memory read/writes, due to the way that GPUs usually work.

However since the TX2 is sharing memory, is there a way to speed it up here without the extra cudaMemcpy?

For reference, my FFT size is 512, looped across many sets.


Have you checked our cuFFT sample?

Complex *h_signal = (Complex *)malloc(sizeof(Complex) * SIGNAL_SIZE);

You can find this example at ‘/usr/local/cuda-8.0/samples/7_CUDALibraries/simpleCUFFT/’.

I tracked down the issue to a combination of:

  1. small FFT size which doesn’t parallelize that well on cuFFT
  2. initial approach of looping a 1D fft plan.

I got some performance gains by:

  1. Setting cuFFT to a batch mode, which reduced some initialization overheads.
  2. allocating the host-side memory using cudaMallocHost, which pegs the CPU-side memory and sped up transfers to GPU device space.

As a result, I’m now up to 20hz using cuFFT, versus 30hz using CPU-based FFTW.
Still seeking methods of speeding things up.


To give a further suggestion, do you mind to share the source of both GPU/CPU-based FFTW?


	int FFTSIZE = 512;
	int BATCHSIZE = 4096;
	int FFTSIZE_OUT = (FFTSIZE/2 + 1);	//real-to-complex fft
	//host side to hold R2C results
	cudaMallocHost((void**)&FFT_data_host, FFTSIZE_OUT * BATCHSIZE * sizeof(Complex));	

	//device side
	cudaMalloc((void**)&cudaFFTin, sizeof(cufftReal) * FFTSIZE * BATCHSIZE);			
	cudaMalloc((void**)&cudaFFTout, sizeof(cufftComplex) * FFTSIZE_OUT * BATCHSIZE);
	cufftHandle planFFT
	    copy data into FFT_data_host
	//copy to device
	cudaMemcpy(cudaFFTin, &FFT_data_host[0], FFTSIZE * BATCHSIZE * sizeof(float), cudaMemcpyHostToDevice);
	cufftExecR2C(planFFT, cudaFFTin, cudaFFTout);
	//wait for sync
	if (cudaDeviceSynchronize() != cudaSuccess){cout << "CUDA: RangeFFT failed to synchronize" << endl;}
	//copy back to host 
	cudaMemcpy(&FFT_data_host[0], cudaFFTout, BATCHSIZE * FFTSIZE_OUT * sizeof(cufftComplex), cudaMemcpyDeviceToHost);

    //use the planmany interface to batch jobs together    
    int rank = 1;
    const int n = FFTSIZE;
    int howmany = BATCHSIZE;
    int idist = FFTSIZE; 
    int odist = FFTSIZE/2+1; 
    int istride = 1; 
    int ostride = 1; 	
    const int *inembed = NULL;
    const int *onembed = NULL;
    FFTW_plan = fftwf_plan_many_dft_r2c(rank, &n, howmany, FFTW_data_in, inembed, istride, idist, reinterpret_cast<fftwf_complex*>(&FFTW_data_out[0]), onembed, ostride, odist, FFTW_MEASURE);
	copy data into FFTW_data_in
    fftwf_execute_dft_r2c(fft_planRange,  reinterpret_cast<float*>(&FFTW_data_in[0]), reinterpret_cast<fftwf_complex*>(&FFTW_data_out[0]));

Is it possible to leverage CUDA unified memory to remove the need for those cudamemCopies?

Let’s say I initialize the buffers using cudaMallocManaged(); under the hood would it still be the same speed, or is my CPU/GPU able to share the same physical ram without transfer overheads?


For sure. You can get more information here: