CUFFT performance not good How to correctly find the excution time on CPU and GPU

gpuguy · May 3, 2010, 9:59am

Hello

I wrote the following code utilizing CUFFT Library for calculating FFT of 256 numbers and in 10 Batches. I have following questions in this regard:

1- Am I using the correct way for calculating the elapsed time for cufftExecC2C ( Basically I am using CUDA Event) ?

2- To calculate the execution time on the CPU, I am simple running the program in emulation mode -deviceemu. Based on the method of calculating the execution time I am using, I am observing that there is no benefit in performing the FFT on the GPU. I checked my results by varying the number of batches from 10 to 6000, and the benefit that I am observing is almost negligible; Am I doing something wrong while calculating the timing?

Thanks in advance.

#include <stdio.h>

#include <math.h>

#include <cuda.h>

#include <cuda_runtime.h>

#include <cufft.h>

#include <cuda.h>

#define NX	  256

#define BATCH   10

int main()

{

		cufftHandle plan;

		cufftComplex *devPtr;

		cufftComplex data[NX*BATCH];

		int i;

																																																													/* source data creation */

		for(i=  0; i < NX*BATCH; i++){

				data[i].x = 1.0f;

				data[i].y = 1.0f;

		}

cudaEvent_t start,stop;

float time;

cudaEventCreate(&start);

cudaEventCreate(&stop);

	/* GPU memory allocation */

		cudaMalloc((void**)&devPtr, sizeof(cufftComplex)*NX*BATCH);

	/* transfer to GPU memory */

		cudaMemcpy(devPtr, data, sizeof(cufftComplex)*NX*BATCH, cudaMemcpyHostToDevice);

		/* creates 1D FFT plan */

		cufftPlan1d(&plan, NX, CUFFT_C2C, BATCH);

/* Timing Calculations*/

cudaEventRecord( start, 0 );

		/* executes FFT processes */

		cufftExecC2C(plan, devPtr, devPtr, CUFFT_FORWARD);

cudaThreadSynchronize();

cudaEventRecord( stop , 0 );

cudaEventSynchronize( stop );

float elapsedTime;

cudaEventElapsedTime( &elapsedTime, start, stop );

printf("Processing time=%f(ms)\n",elapsedTime);

cudaEventDestroy( start );  

cudaEventDestroy( stop );

		   /* transfer results from GPU memory */

		cudaMemcpy(data, devPtr, sizeof(cufftComplex)*NX*BATCH, cudaMemcpyDeviceToHost);

		/* deletes CUFFT plan */

		cufftDestroy(plan);

	/* frees GPU memory */

		cudaFree(devPtr);

		/*for(i = 0; i < NX*BATCH; i++){

				printf("data[%d] %f %f\n", i, data[i].x, data[i].y);

		}*/

		return 0;

}

mborgerd · May 4, 2010, 7:12pm

10x 256 point FFTs is a very small amount of data. I think in order to get impressive performance, you will need to increase your test size two or three orders of magnitude. That will require you to get the data off the stack and onto the heap. Oh and you may want to use pinned memory.

Topic		Replies	Views
CUFFT Newbei Question CUDA Programming and Performance	1	2898	May 4, 2010
FFT Computation Timing constraint on GPU. CUDA Programming and Performance	0	705	August 22, 2014
FFT Performance CUDA Programming and Performance	4	4656	March 3, 2010
Performance of CuFFT 3.1 library CUDA Programming and Performance	0	3258	July 8, 2011
Benchmarking Paricular Sized CUFFT I have a CUFFT, and I can't seem to get anywhere near optimal CUDA Programming and Performance	0	2211	April 27, 2009
cufft doubt comparing r2c and c2c 2D FFTs CUDA Programming and Performance	28	13479	October 27, 2010
CUFFT issue CUDA Programming and Performance	0	1105	December 29, 2009
Estimating FFT Performance CUDA Programming and Performance	9	1523	June 4, 2010
Bad Performance of CUFFT library? compilation flags for optimizing fft performance CUDA Programming and Performance	11	13482	February 17, 2012
cuFFT Timing Jetson TX2	14	2436	October 18, 2021

CUFFT performance not good How to correctly find the excution time on CPU and GPU

Related topics