cuFFT Timing

Hello,

we are new to the Nvidia Tx2 platform and want to evaluate the cuFFT Performance.

We modified the simpleCUFFT example and measure the timing as follows.

#define FFT_LENGTH      512
#define NR_OF_FFT	98304

void runTest(int argc, char **argv)
{
    float elapsedTimeInMs = 0.0f;
    StopWatchInterface *timer = NULL;
    sdkCreateTimer(&timer);

	
    printf("[simpleCUFFT] is starting...\n");

    findCudaDevice(argc, (const char **)argv);

    // Allocate host memory for the signal
    int memSizeIn = FFT_LENGTH * NR_OF_FFT * sizeof(float);
    int memSizeOut = FFT_LENGTH * NR_OF_FFT  * sizeof(Complex);
    float *pDataIn = (float *)malloc(memSizeIn);
    Complex *pDataOut = (Complex *)malloc(memSizeOut);
    memset(pDataOut,0,memSizeOut);

#if 0
    FILE *pFileRead;
    pFileRead = fopen("rawData.bin","rb");
    if(pFileRead == NULL){
	printf("no valid file");
	return;
    }
    fread(pDataIn,sizeof(float),FFT_LENGTH * NR_OF_FFT,pFileRead);
    fclose(pFileRead);
#endif

    // Allocate device memory for signal
    float *pCudaFftIn;
    Complex *pCudaFftOut;
    checkCudaErrors(cudaMalloc((void **)&pCudaFftIn, memSizeIn));
    checkCudaErrors(cudaMalloc((void **)&pCudaFftOut, memSizeOut));


    // CUFFT plan simple API
    cufftHandle plan;
    checkCudaErrors(cufftPlan1d(&plan, FFT_LENGTH, CUFFT_R2C, NR_OF_FFT));

    // Copy host memory to device
    sdkResetTimer(&timer);
    sdkStartTimer(&timer);
    checkCudaErrors(cudaMemcpy(pCudaFftIn, pDataIn, memSizeIn,cudaMemcpyHostToDevice));
    sdkStopTimer(&timer);
    elapsedTimeInMs = sdkGetTimerValue(&timer);
    printf("copy data from CPU to GPU: %f\n", elapsedTimeInMs);
	
    while(1){
	usleep(60000);
	
	
	// calculate fft 
	sdkResetTimer(&timer);
	sdkStartTimer(&timer);
	checkCudaErrors(cufftExecR2C(plan, (cufftReal *)pCudaFftIn, (cufftComplex *)pCudaFftOut));
	checkCudaErrors(cudaDeviceSynchronize());
	sdkStopTimer(&timer);	
	elapsedTimeInMs = sdkGetTimerValue(&timer);
	printf("execute fft: %f\n", elapsedTimeInMs);	
		
    }
	

    // Copy device memory to host
    sdkResetTimer(&timer);
    sdkStartTimer(&timer);
    checkCudaErrors(cudaMemcpy(pDataOut, pCudaFftOut, memSizeOut,cudaMemcpyDeviceToHost));
    sdkStopTimer(&timer);
    elapsedTimeInMs = sdkGetTimerValue(&timer);
    printf("copy data from GPU to CPU: %f\n", elapsedTimeInMs);

	
    //Destroy CUFFT context
    checkCudaErrors(cufftDestroy(plan));

#if 0
    FILE *pFile;
    pFile = fopen("FftOut.bin","wb"); 
    fwrite(pDataOut,sizeof(Complex),(FFT_LENGTH/2+1) * NR_OF_FFT,pFile);
    fclose(pFile);
#endif
	
    // cleanup memory
    free(pDataIn);
    free(pDataOut);
    checkCudaErrors(cudaFree(pCudaFftIn));
    checkCudaErrors(cudaFree(pCudaFftOut));
}

The fft (512length, 98304 times) execution time is:
execute fft: 129.251999
execute fft: 117.674004
execute fft: 117.697998
execute fft: 117.774002
execute fft: 83.541000
execute fft: 27.283001
execute fft: 27.288000

Which value represent the true execution time? Why differs? Are we using the timing SDK in a correct way?

The transfer of 192MByte from Host to Device needs about 80ms. The transfer of 384MByte from Device to Host needs about 200ms. Is that ok?

Are we doing anything wrong with timing or using the fft library?

Best Regards
Philipp

Hi,

  1. Please remember to maximize CPU/GPU clock first:
sudo ./jetson_clocks.sh
  1. FFT time should be the evert around cufftExecR2C.

  2. It may take longer from the device -> host. Memory copy will start when all the FFT jobs are finished.

Thanks.

Thanks for your answer.

To Point 1:
The fft time reduce to 23ms. The transfer of the data is the same.

To Point 2:
The timing of cufftExecR2C needs 0.045ms. But cufftExecR2C only start the job on GPU. We execute “cudaDeviceSynchronize()” after cufftExecR2C and evert the timing around. Because we think that cudaDeviceSynchronize() wait till all jobs on GPU are done. So the timing is the real value of exucte fft, isn’t it?

To Point 3:
After cudaDeviceSynchronize() all FFT jobs are done. Memory copy can start immediately.

Are my ideas correct?

Hi,

  1. Job execution time will decrease when CPU/GPU clock goes higher.
    Memory copy remains the same since the memory bandwidth is fixed.

  2. Sure. You need to insert cudaDeviceSynchronize before stopping the timer or only launch time is calculated.

  3. Sorry for the wrong information. Longer time should come from the more extensive memory buffer

Thanks

Hello,

Can we figure out, why the execution time differs? Have the operating sytem a big influence to the GPU?

Use cufftPlan1d() the maximum performance of GPU? How can we check the usage of the GPU? The Nvidia Visual Profiler is for beginners very complex.

Thanks

Hi all,

first of all thanks for the interesting post.

My company is also thinking about using the Jetson TX2 platform. We also need to calculate many FFT’s.

My question is:

Is it possible to calculate the FFT’s faster or is 23 ms really the optimum limit which can be reached for calculating a 512 point FFT 98304 times (real to complex).

Best Regards
Dirk

Hi all,

first of all thanks for the interesting post.

My company is also thinking about using the Jetson TX2 platform. We also need to calculate many FFT’s.

My question is:

Is it possible to calculate the FFT’s faster or is 23 ms really the optimum limit which can be reached for calculating a 512 point FFT 98304 times (real to complex).

Best Regards
Dirk

Hi philipp.goetz,

The script fixes the GPU clock to the maximal.
So it will be faster than the default dynamic clock rate.

Thanks

Hi dirk.gehring,

We are talking a customized FFT code here.
It’s recommended to test our standard CUDA sample for evaluation.

Our cuFFT sample can be found here:
/usr/local/cuda-8.0/samples/7_CUDALibraries/simpleCUFFT

Thanks

AastaLLL: “The script fixes the GPU clock to the maximal.”

Do you mean the jetson_clocks.sh script, which we talked above?

sudo ./jetson_clocks.sh

You may refer to below links:


http://docs.nvidia.com/cuda/cufft/index.html

Thanks a lot for your answer AastaLLL.

I played a little bit with FFT example.

My usecase is to calculate two FFT. The output from the first FFT is the input for the second FFT.

The fft1 (512length) is executed 98304 times.
The fft2 (1024length) is executed 24576 times.

Before calculating fft2 i have to reorder the data. I want to use the advanced data layout provided by cufftPlanMany.

Here is the timing, that look really good to me:

execute fft1 : 28.778000
execute fft2 : 12.420000

But,as soon as i change the input handling of fft2 i get this timing:
execute fft1 : 27.136999
execute fft2 : 686.557007

fft2 increased by a factor of 55.

It seems i’m doing something wrong. Attached you find the code i use. The good timing behaviour can be enabled by replacing #if 0 by #if 1.

Can someone please let me know what i’m doing wrong.

Many thanks and have a nice weekend.

Dirk

/*
 * Copyright 1993-2015 NVIDIA Corporation.  All rights reserved.
 *
 * Please refer to the NVIDIA end user license agreement (EULA) associated
 * with this source code for terms and conditions that govern your use of
 * this software. Any use, reproduction, disclosure, or distribution of
 * this software and related documentation outside the terms of the EULA
 * is strictly prohibited.
 *
 */

/*
 * Copyright 1993-2014 NVIDIA Corporation.  All rights reserved.
 *
 * Please refer to the NVIDIA end user license agreement (EULA) associated
 * with this source code for terms and conditions that govern your use of
 * this software. Any use, reproduction, disclosure, or distribution of
 * this software and related documentation outside the terms of the EULA
 * is strictly prohibited.
 *
 */

/* Example showing the use of CUFFT for fast 1D-convolution using FFT. */

// includes, system
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <math.h>

// includes, project
#include <cuda_runtime.h>
#include <cufft.h>
#include <cufftXt.h>
#include <helper_functions.h>
#include <helper_cuda.h>
#include <unistd.h>

typedef float2 Complex;

//#define ZERO_COPY	

////////////////////////////////////////////////////////////////////////////////
// declaration, forward
void runTest(int argc, char **argv);

// The filter size is assumed to be a number smaller than the signal size


////////////////////////////////////////////////////////////////////////////////
// Program main
////////////////////////////////////////////////////////////////////////////////
int main(int argc, char **argv)
{
  runTest(argc, argv);
}
////////////////////////////////////////////////////////////////////////////////
//! Run a simple test for CUDA
////////////////////////////////////////////////////////////////////////////////
void runTest(int argc, char **argv)
{
  /* declarations */
  float elapsedTimeInMs;
  StopWatchInterface *timer;
  int memSizeInFFT1;
  int memSizeOutFFT1;
  int memSizeOutFFT2;
  float *pDataInFFT1;
  Complex *pDataOutFFT1;
  Complex *pDataOutFFT2;
  FILE *pFileRead;
  float *pFileData;
  cufftHandle planFFT1;
  cufftHandle planFFT2;
  float *pCudaFftInFFT1;
  Complex *pCudaFftOutFFT1;
  Complex *pCudaFftOutFFT2;
  /* variables for planMany range */
  int istrideFFT1, ostrideFFT1;
  int idistFFT1, odistFFT1;
  int rankFFT1;
  int rankFFT1Array[3];
  int inembedArrayFFT1[3];
  int onembedArrayFFT1[3];
  /* variables for planMany speed */
  int istrideFFT2, ostrideFFT2;
  int idistFFT2, odistFFT2;
  int rankFFT2;
  int rankFFT2Array[3];
  int inembedArrayFFT2[3];
  int onembedArrayFFT2[3];

  /* initialisations */
  elapsedTimeInMs = 0.0f;
  sdkCreateTimer(&timer);
  memSizeInFFT1 =  512 * (1024*32*3) * sizeof(float);
  memSizeOutFFT1 = 512 * (1024*32*3)   * sizeof(Complex);
  memSizeOutFFT2 = 1024 * (256*32*3)* sizeof(Complex)*2;
  pDataInFFT1 = NULL;
  pDataOutFFT1 = NULL;

  /* configure planeMany for range */
#if 0
  rankFFT1 = 1;
  rankFFT1Array[0] = 512;
  inembedArrayFFT1[0] = 512;
  onembedArrayFFT1[0] = 512;
  istrideFFT1 = 1;
  idistFFT1 = 512;
  ostrideFFT1 = 1;
  odistFFT1 = 512;

  /* configure planeMany for speed */
  rankFFT2 = 1;
  rankFFT2Array[0] = 1024;
  inembedArrayFFT2[0] = 1024;
  onembedArrayFFT2[0] = 1024;
  istrideFFT2 = 1; 
  idistFFT2 = 1024;
  ostrideFFT2 = 1;
  odistFFT2 = 1024;
#else
  rankFFT1 = 1;
  rankFFT1Array[0] = 512;
  inembedArrayFFT1[0] = 512;
  onembedArrayFFT1[0] = 512;
  istrideFFT1 = 1;
  idistFFT1 = 512;
  ostrideFFT1 = 1;
  odistFFT1 = 512;

  /* configure planeMany for speed */
  rankFFT2 = 1;
  rankFFT2Array[0] = 1024;
  inembedArrayFFT2[0] = 1024;
  onembedArrayFFT2[0] = 1024;
  istrideFFT2 = (256*32*3); 
  idistFFT2 = 1;
  ostrideFFT2 = 1;
  odistFFT2 = 1024;

#endif
  /* default call */
  findCudaDevice(argc, (const char **)argv);

  printf("[simpleCUFFT] is starting...\n");

  /* memory allocation */

  printf("memsite FFT1_In:%i\n",memSizeInFFT1);
  printf("memsite FFT1_Out:%i\n",memSizeOutFFT1);
  printf("memsite FFT2_Out:%i\n",memSizeOutFFT2);

  cudaSetDeviceFlags(cudaDeviceMapHost);
  checkCudaErrors(cudaHostAlloc((void **)&pDataInFFT1, memSizeInFFT1, cudaHostAllocMapped));
  checkCudaErrors(cudaHostAlloc((void **)&pDataOutFFT1, memSizeOutFFT1, cudaHostAllocMapped));
  checkCudaErrors(cudaHostAlloc((void **)&pDataOutFFT2, memSizeOutFFT2, cudaHostAllocMapped));
  memset(pDataOutFFT1,0,memSizeOutFFT1);
  memset(pDataOutFFT2,0,memSizeOutFFT2);

  /* read input from file */
  pFileData = (float *)malloc(memSizeInFFT1);
  if(!pFileData){
    printf("memory allocation error\n");
    return;
  }
  pFileRead = fopen("rawData.bin","rb");
  if(pFileRead == NULL){
    printf("no valid file");
    return;
  }
  fread(pFileData,sizeof(float),512* (1024*32*3),pFileRead);
  fclose(pFileRead);

  /* create FFT planFFT1 */
  //checkCudaErrors(cufftPlan1d(&planFFT1, FFT_LENGTH_RANGE, CUFFT_R2C, NR_OF_FFT_RANGE));

  /* range fft */
  checkCudaErrors(cufftPlanMany( &planFFT1, rankFFT1, rankFFT1Array, \
		  &inembedArrayFFT1[0],istrideFFT1, idistFFT1, \
		  &onembedArrayFFT1[0],ostrideFFT1, odistFFT1, \
		  CUFFT_R2C, (1024*32*3)));

  /* speed fft */
  checkCudaErrors(cufftPlanMany( &planFFT2, rankFFT2, rankFFT2Array, \
		  &inembedArrayFFT2[0],istrideFFT2, idistFFT2, \
		  &onembedArrayFFT2[0],ostrideFFT2, odistFFT2, \
		  CUFFT_C2C, (256*32*3)));

  /* convert pointer */
  checkCudaErrors(cudaHostGetDevicePointer((void **)&pCudaFftInFFT1, (void *)pDataInFFT1, 0));
  checkCudaErrors(cudaHostGetDevicePointer((void **)&pCudaFftOutFFT1, (void *)pDataOutFFT1, 0));
  checkCudaErrors(cudaHostGetDevicePointer((void **)&pCudaFftOutFFT2, (void *)pDataOutFFT2, 0));

  for(int i=0;i<10;i++){

    /* time memcpy */
    sdkResetTimer(&timer);
    sdkStartTimer(&timer);
    memcpy(pDataInFFT1,pFileData,memSizeInFFT1);
    sdkStopTimer(&timer);	
    elapsedTimeInMs = sdkGetTimerValue(&timer);
    printf("copy rawdata from CPU RAM to GPU: %f\n", elapsedTimeInMs);	

    /* time first fft */
    sdkResetTimer(&timer);
    sdkStartTimer(&timer);
    checkCudaErrors(cufftExecR2C(planFFT1, (cufftReal *)pCudaFftInFFT1, (cufftComplex *)pCudaFftOutFFT1));
    checkCudaErrors(cudaDeviceSynchronize());
    sdkStopTimer(&timer);	
    elapsedTimeInMs = sdkGetTimerValue(&timer);
    printf("execute fft1 : %f\n", elapsedTimeInMs);	

    /* time second fft */
    sdkResetTimer(&timer);
    sdkStartTimer(&timer);
    checkCudaErrors(  cufftExecC2C(planFFT2, \
		      (cufftComplex *)pCudaFftOutFFT2, \
		      (cufftComplex *)pCudaFftOutFFT2, \
		      CUFFT_FORWARD));
    checkCudaErrors(cudaDeviceSynchronize());
    sdkStopTimer(&timer);	
    elapsedTimeInMs = sdkGetTimerValue(&timer);
    printf("execute fft2 : %f\n", elapsedTimeInMs);	

  }

  /* free planFFT1 memory */
  checkCudaErrors(cufftDestroy(planFFT1));
  checkCudaErrors(cufftDestroy(planFFT2));

  /* write result to file */
  FILE *pFile;
  pFile = fopen("FftOut.bin","wb");
  fwrite(pDataOutFFT1,sizeof(Complex),(512/2+1) * (1024*32*3),pFile);
  fclose(pFile);

  /*  free memory */
  cudaFreeHost(pDataInFFT1);
  cudaFreeHost(pDataOutFFT1);
  cudaFreeHost(pDataOutFFT2);
  free(pFileData);
  //free(pDataInFFT1);
  //free(pDataOutFFT1);

}

Hi,

You can check here for the cufftPlanMany document:
http://docs.nvidia.com/cuda/cufft/index.html#function-cufftmakeplanmany

Thanks.

Hello everybody,

Do I understand correctly that Jetson TX2 calculates 512 point FFT 98304 times in 28.778ms? I need to calculate 8192 point FFT 200000 times per second. It means that Jetson TX2 should be able to calculate it is I right? Thank you for support.

Martin