Profiling using cuFFT

Hi all,
I am using cuFFT library to find the FFT in TeslaK80 GPU. I did profiling using nvprof for a cuFFT plan for 1D complex to complex Fourier transform for batch size = 1.

FFT Size Time Taken(micro secs) for Batch = 1
64---------------- 5.6960
128----------------- 9.152
256-------------- 13.184
512---------------- 10.208
1024------------ 17.056
2048----------------- 18.656
4096-------------- 21.64

As is seen in the table the time taken for 512 FFT is less than a lower point FFT(256) can someone boil down a reason for this behaviour.

measurement or benchmarking practice error is one possibility.


  • Filename -

  • Author - Sajani Shajan

  • @brief cuFFt library to compute 1D fourier transform of complex input.

  • Input and output length remains the same

  • @details A cuFFt plan is created specifying the dimension of the transform, size of

  • the transform, type of the transform, the number of batches the transform must be

  • performed and the stride in input and output in case of batched processing.

  • cufftHandle(plan) is then executed specifing the input and output pointers and the

  • direction of transform.

  • The input sample is generated within the code.

    #include <cuda.h>
    #include <cufft.h>
    #include <stdio.h>
    #include <math.h>

    #define DATASIZE 4096 // Input size
    #define BATCH 1 // Number of batches of transform done by a plan

    #define gpuErrchk(ans) { gpuAssert((ans), FILE, LINE); }
    inline void gpuAssert(cudaError_t code, const char file, int line, bool abort=true)
    if (code != cudaSuccess)
    fprintf(stderr,“GPUassert: %s %s %dn”, cudaGetErrorString(code), file, line);
    if (abort) exit(code);

    int main ()
    // Declaring iterators
    int batchIter, dataLengthIter;
    // Declaring parameters to create plan
    cufftHandle plan;
    // Rank denotes the dimension of the transform
    int rank = 1;
    // Size of the Fourier transform
    int n = { DATASIZE };
    // Distance between two successive input/output elements
    int iStride = 1, oStride = 1;
    // Distance between batches
    int iDist = DATASIZE, oDist = DATASIZE;
    // Number of batched executions
    int batch = BATCH;
    // Input size with pitch (ignored for 1D transforms)
    int inEmbed = { 0 };
    // Output size with pitch (ignored for 1D transforms)
    int onEmbed = { 0 };

    // Declaring events to record kernel execution
    cudaEvent_t start, stop;
    float milliseconds = 0;

    // Declaring device variable for input and output
    cufftComplex *deviceInputData;
    cufftComplex *deviceOutputData;

    // Declaring and allocating host variable for input
    cufftComplex hostInputData = ( cufftComplex )malloc( DATASIZE * BATCH * sizeof( cufftComplex ) );
    // Declaring and allocating host variable for output
    cufftComplex hostOutputData = ( cufftComplex )malloc( DATASIZE * BATCH * sizeof( cufftComplex ) );

    // Allocating memory in the device for input and output
    gpuErrchk( cudaMalloc( (void**)&deviceInputData, DATASIZE * BATCH * sizeof( cufftComplex) ) );
    gpuErrchk( cudaMalloc( (void**)&deviceOutputData, (DATASIZE) * BATCH * sizeof( cufftComplex) ) );

    // Creating a plan for batched 1D complex-to-complex n- point Fourier transform
    cufftPlanMany( &plan, rank, n, inEmbed, iStride, iDist, onEmbed,
    oStride, oDist, CUFFT_C2C, batch );

    // Loop to generate real and complex input
    for ( batchIter = 0; batchIter < BATCH; batchIter++ ) {
    for ( dataLengthIter = 0; dataLengthIter < DATASIZE; dataLengthIter++ ) {
    hostInputData[batchIter * DATASIZE + dataLengthIter].x = dataLengthIter + 1;
    hostInputData[batchIter * DATASIZE + dataLengthIter].y = dataLengthIter + 1;
    /* printf( “%f + %f I\n”, hostInputData[batchIter * DATASIZE + dataLengthIter].x,
    hostInputData[batchIter * DATASIZE + dataLengthIter].y );*/

    // Copying input from host to device variable
    cudaMemcpy( deviceInputData, hostInputData, DATASIZE * BATCH * sizeof( cufftComplex ), cudaMemcpyHostToDevice );

    // Executing the plan. The input and output pointers and direction of transformation is specified
    cufftExecC2C( plan, deviceInputData, deviceOutputData, CUFFT_FORWARD );

    // Device->Host copy of the results
    gpuErrchk( cudaMemcpy( hostOutputData, deviceOutputData, DATASIZE * BATCH * sizeof( cufftComplex ), cudaMemcpyDeviceToHost ) );

    // Displaying the output
    /for (int batchIter = 0; batchIter < BATCH; batchIter++) {
    for (int dataLengthIter = 0; dataLengthIter < DATASIZE ; dataLengthIter++) {
    printf( “%i %i %f %f\n”, batchIter, dataLengthIter, hostOutputData[batchIter
    DATASIZE + dataLengthIter].x,
    hostOutputData[batchIter * DATASIZE + dataLengthIter].y );

    cudaEventElapsedTime(&milliseconds, start, stop);
    printf(" Execution Time in milliseconds: %f\n", milliseconds);
    // Destroying the plan
    cufftDestroy( plan );
    // Free-ing memory
    gpuErrchk( cudaFree(deviceOutputData) );
    gpuErrchk( cudaFree(deviceInputData) );

This is the code I used.

I used nvprof to do the profiling, I dont understand how that can give the wrong time.

when posting code, please use the code posting button. Select your code, then in the edit window at the top toolbar, click the </> button.

CUFFT performance can vary based on GPU as well as transform size, as well as the prime factorization of the transform size.

Kepler GPUs in particular may have different performance patterns than newer GPUs in CUFFT.

For instance if you try a transform size of 768 I think you may also see another non-monotonicity in performance.

If you switch to a newer GPU I think you’ll find that even at these very small transform sizes, the performance varies almost monotonically with transform sizes that are a power of 2.

This is largely a function of the specific kernel design used by the CUFFT library. These transform sizes are quite small, so they are not in any way saturating the GPU. If you do large enough transform sizes to hit the memory bandwidth limit of the GPU, I think you will find monotonically increasing time (and approximately linearly increasing) for larger transform sizes that are a power-of-2.

If these non-monotonic differences in performance of very small transform sizes on the oldest generation of GPUs currently supported is a matter of concern, I would suggest filing a bug.

I am observing the same break in monotonisity in execution time when I am using VoltaV100 GPU as well.
Profiler output for FFT using cuFFT library in Volta V100 GPU
FFT Size | Batch = 1(micro sec) | Batch = 10000(micro sec)
64| 2.56 | 13.18
128 |5.68 |36.415
256 |7.26 |63.07
512 |4.32 |107.36
1024 |5.408 |212.89
2048 |10.43 |407.51
4096 |11.99 |824.04

I am just wondering how the kernel is designed for a particular FFT. How is the grid and block size calculated or can the user modify the grid and block size.
It would be of great help if you could direct me to resources that lead to the answers of these questions other than the cuFFT guide( because I have gone through it several times)

Using good benchmarking practice (batch = 10000) there is no break in monotonicity in your results:

13.18 < 36.415 < 63.07 < 107.36 < 212.89 < 407.51 < 824.04

I wouldn’t be able to address the batch = 1 case. Accurately benchmarking things that take 5 microseconds on the GPU device side is very difficult.

I won’t be able to help with that. This is a closed source library. I don’t have permission to release material non-public details.

You can inspect the actual grid and block size associated with any kernel launch using the profiler. Yes, I acknowledge this doesn’t answer the question “how is it calculated”, but I thought I would point it out anyway. If it’s not useful, please disregard that info and see my previous statement.

The user has no direct control over grid and block sizes for most CUDA libraries (e.g. CUBLAS, CUFFT, NPP, etc.) There may be some indirect effects because the grid and block sizes will vary according to your total problem size.

Sorry, I won’t be able to do that. The only public resource I’m aware of is the CUFFT manual. There may be writeups on the web or microbenchmarking technical papers that have been written about CUFFT specifically, but I don’t have a catalog or list of those. You may possibly discover some with some searching.

I can see that my answers are frustrating you now, so I’ll refrain from answering any further questions. Perhaps someone else in the community will be able to address your questions.

Thanks Robert
If someone in this forum have some insight regarding this please help out

Are the tensor cores engaged while usimg cuFFt library in VoltaV100 GPU? Is there a way to find if the tensor cores are used in any application we perform?