CUDA enabled cuFFT slower than general purpose FFTW

xtracrispy · September 21, 2017, 12:22am

Hello,

Today I ported my code to use nVidia’s cuFFT libraries, using the FFTW interface API (include cufft.h instead, keep same function call names etc.)

What I found is that it’s much slower than before:

30hz using CPU-based FFTW
1hz using GPU-based cuFFTW

I have already tried enabling all cores to max, using:

nvpmodel -m 0

The code flow is the same between the two variants. Could the speed difference be due to data alignment for SIMD?

FFTW memory allocation was done using fft_malloc_complex
cuFFTW memory allocation was done using regular C++ array initialization of Complex datatypes

AastaLLL · September 21, 2017, 1:58am

Hi,

Could you lock frequency to the max and try it again?

sudo ~/jetson_clocks.sh

Thanks.

xtracrispy · September 21, 2017, 3:58am

Thanks for the suggestion.
I’m now up to 2hz, instead of 1.

However, I was at 30 before with the regular CPU-based FFTW.

xtracrispy · September 21, 2017, 4:37am

I have reworked my code to use the native cufft function calls thusly:

cudaMalloc((void**)&FFTout, sizeof(cufftComplex)*(BINSIZE/2 +1));        
cudaMalloc((void**)&FFTin, sizeof(cufftReal)*(BINSIZE));        

cufftPlan1d(&planFFT,BINSIZE, CUFFT_R2C, 1);

cudaMemcpy(FFTin, data, BINSIZE*sizeof(cufftReal), cudaMemcpyHostToDevice);
cufftExecR2C(planFFT, FFTin, FFTout);				
cudaMemcpy(&output, FFTout, (BINSIZE/2+1)*sizeof(cufftComplex), cudaMemcpyDeviceToHost);

I am now observing 3hz total speed.
My best guess is that it’s due to the extra memory read/writes, due to the way that GPUs usually work.

However since the TX2 is sharing memory, is there a way to speed it up here without the extra cudaMemcpy?

For reference, my FFT size is 512, looped across many sets.

AastaLLL · September 22, 2017, 6:21am

Hi,

Have you checked our cuFFT sample?

Complex *h_signal = (Complex *)malloc(sizeof(Complex) * SIGNAL_SIZE);

You can find this example at ‘/usr/local/cuda-8.0/samples/7_CUDALibraries/simpleCUFFT/’.
Thanks.

xtracrispy · September 23, 2017, 1:18am

I tracked down the issue to a combination of:

small FFT size which doesn’t parallelize that well on cuFFT
initial approach of looping a 1D fft plan.

I got some performance gains by:

Setting cuFFT to a batch mode, which reduced some initialization overheads.
allocating the host-side memory using cudaMallocHost, which pegs the CPU-side memory and sped up transfers to GPU device space.

As a result, I’m now up to 20hz using cuFFT, versus 30hz using CPU-based FFTW.
Still seeking methods of speeding things up.

AastaLLL · September 25, 2017, 7:25am

Hi,

To give a further suggestion, do you mind to share the source of both GPU/CPU-based FFTW?
Thanks.

xtracrispy · September 25, 2017, 4:56pm

Sizes:

	int FFTSIZE = 512;
	int BATCHSIZE = 4096;
	int FFTSIZE_OUT = (FFTSIZE/2 + 1);	//real-to-complex fft

CUDA:
		
	//host side to hold R2C results
	cudaMallocHost((void**)&FFT_data_host, FFTSIZE_OUT * BATCHSIZE * sizeof(Complex));	

	//device side
	cudaMalloc((void**)&cudaFFTin, sizeof(cufftReal) * FFTSIZE * BATCHSIZE);			
	cudaMalloc((void**)&cudaFFTout, sizeof(cufftComplex) * FFTSIZE_OUT * BATCHSIZE);
	
	//plan
	cufftHandle planFFT
	cufftPlan1d(&planFFT,FFTSIZE, CUFFT_R2C, BATCHSIZE);
	
	    ... 
	    copy data into FFT_data_host
	    ...
	 		
	//copy to device
	cudaMemcpy(cudaFFTin, &FFT_data_host[0], FFTSIZE * BATCHSIZE * sizeof(float), cudaMemcpyHostToDevice);
		
	//execute
	cufftExecR2C(planFFT, cudaFFTin, cudaFFTout);
		
	//wait for sync
	if (cudaDeviceSynchronize() != cudaSuccess){cout << "CUDA: RangeFFT failed to synchronize" << endl;}
	
	//copy back to host 
	cudaMemcpy(&FFT_data_host[0], cudaFFTout, BATCHSIZE * FFTSIZE_OUT * sizeof(cufftComplex), cudaMemcpyDeviceToHost);

FFTW:


    //use the planmany interface to batch jobs together    
    int rank = 1;
    const int n = FFTSIZE;
    int howmany = BATCHSIZE;
    int idist = FFTSIZE; 
    int odist = FFTSIZE/2+1; 
    int istride = 1; 
    int ostride = 1; 	
    const int *inembed = NULL;
    const int *onembed = NULL;
    
    FFTW_plan = fftwf_plan_many_dft_r2c(rank, &n, howmany, FFTW_data_in, inembed, istride, idist, reinterpret_cast<fftwf_complex*>(&FFTW_data_out[0]), onembed, ostride, odist, FFTW_MEASURE);
	
	...
	copy data into FFTW_data_in
	...
	
    fftwf_execute_dft_r2c(fft_planRange,  reinterpret_cast<float*>(&FFTW_data_in[0]), reinterpret_cast<fftwf_complex*>(&FFTW_data_out[0]));

xtracrispy · September 25, 2017, 5:07pm

Is it possible to leverage CUDA unified memory to remove the need for those cudamemCopies?

Let’s say I initialize the buffers using cudaMallocManaged(); under the hood would it still be the same speed, or is my CPU/GPU able to share the same physical ram without transfer overheads?

Thanks

AastaLLL · September 26, 2017, 1:37am

For sure. You can get more information here:
[url]Programming Guide :: CUDA Toolkit Documentation

Topic		Replies	Views
FFT Speed vs. x86 CUDA Programming and Performance	14	25050	July 27, 2008
Multiple batches of 1D FFT using cuFFT GPU-Accelerated Libraries	10	5316	October 29, 2019
Poor CUFFT Performance? Am I doing something wrong? CUDA Programming and Performance	15	15663	May 4, 2010
cuFFT Timing Jetson TX2	14	2558	October 18, 2021
my speedy FFT 3x faster than CUFFT CUDA Programming and Performance	139	241838	November 16, 2011
optimizing FFT calculation? CUDA Programming and Performance	8	6601	May 26, 2008
cufft doubt comparing r2c and c2c 2D FFTs CUDA Programming and Performance	28	13735	October 27, 2010
cufft1D and MemcpyAsync = gain? optimization of Host/Device transfert CUDA Programming and Performance	2	799	March 13, 2011
Bandwidth: A big problem when using cuFFT cudamemcpy between device and Host maybe bottleneck,etc CUDA Programming and Performance	4	2749	December 23, 2008
CUFFT DP performance inconsistent CUDA Programming and Performance	3	1379	August 14, 2009

CUDA enabled cuFFT slower than general purpose FFTW

Related topics