Understanding profiling (nvprof) output of cuFFT

When I run one double precision, 3D (128^3), real-to-complex transform under nvprof, the output is

Time(%)      Time     Calls       Avg       Min       Max  Name
58.15%  5.9890ms         1  5.9890ms  5.9890ms  5.9890ms __nv_static_45__32_dpRealComplex_compute_70_cpp1_ii_f01f2bc9__ZN13dpRealComplex24postprocessC2C_kernelMemIjdL9fftAxii_t7EEEvP7ComplexIT0_EPKS4_T_15coordDivisors_tIS8_E7coord_tIS8_ESC_S8_S3_10callback_t

41.12%  4.2347ms         2  2.1174ms  78.079us  4.1567ms  void regular_fft<unsigned int=128, unsigned int=1, unsigned int=8, unsigned int=16, unsigned int=1, unsigned int=0, unsigned int=2, unsigned int=1, unsigned int=1, unsigned int=1, unsigned int, double>(kernel_arguments_t<unsigned 

0.73%  75.167us         1  75.167us  75.167us  75.167us  void vector_fft<unsigned int=64, unsigned int=1, unsigned int=8, unsigned int=8, unsigned int=0, unsigned int=0, unsigned int=2, unsigned int=1, unsigned int=1, unsigned int=0, unsigned int, double>(kernel_arguments_t<unsigned int>)

I am wondering if anyone knows how to interpret this - why there are two calls to “regular_fft”, why there are four total fft function calls (each taking about 75us), when I only ran one FFT. (I am trying to scale an application which calls millions of FFTs, and I would like to understand what each of these is).

When you make a cufft call, that is a C-library function call. Within that function, any number of CUDA activities may transpire, such as kernel calls, CUDA API calls, etc. In providing a single FFT, CUFFT may choose to perform multiple kernel calls, and possibly other activity as well.

That’s a general description applicable to most CUDA library calls.

If you’re looking for a precise description of what the purpose of each of these 4 calls is, I doubt you’re going to get that.

Sometimes its possible to infer something about the activity based on the naming of the kernels. For example, when I do a double precision real-to-complex transform (D2Z) on a data set of size 65536, in-place, I get the following output:

$ cat t81.cu
#include <cufft.h>
#include <stdio.h>
int main(){

#define NX (256*256)
#define BATCH 1
cufftHandle plan;
cufftDoubleComplex *data;
cudaMalloc((void**)&data, sizeof(cufftDoubleComplex)*(NX/2+1)*BATCH);
if (cudaGetLastError() != cudaSuccess){ fprintf(stderr, "Cuda error: Failed to allocate\n"); return -1; }
if (cufftPlan1d(&plan, NX, CUFFT_D2Z, BATCH) != CUFFT_SUCCESS){ fprintf(stderr, "CUFFT error: Plan creation failed"); return -1;        }
        /* Use the CUFFT plan to transform the signal in place. */
if (cufftExecD2Z(plan, (cufftDoubleReal*)data, data) != CUFFT_SUCCESS){ fprintf(stderr, "CUFFT error: ExecC2C Forward failed"); return -1;      }
if (cudaDeviceSynchronize() != cudaSuccess){ fprintf(stderr, "Cuda error: Failed to synchronize\n"); return -1; }
cufftDestroy(plan); cudaFree(data);
return 0;
}
$ nvcc -arch=sm_60 -o t81 t81.cu -lcufft
$ nvprof ./t81
==26624== NVPROF is profiling process 26624, command: ./t81
==26624== Profiling application: ./t81
==26624== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   40.19%  12.320us         1  12.320us  12.320us  12.320us  void dpRadix0256C::kernel1Mem<unsigned int, double, fftDirection_t=-1, unsigned int=8, unsigned int=2, CONSTANT, ALL, WRITEBACK>(kernel_parameters_t<fft_mem_radix1_t, unsigned int, double>)
                   29.96%  9.1840us         1  9.1840us  9.1840us  9.1840us  void dpRadix0128C::kernel3Mem<unsigned int, double, fftDirection_t=-1, unsigned int=8, unsigned int=4, CONSTANT, ALL, WRITEBACK>(kernel_parameters_t<fft_mem_radix3_t, unsigned int, double>)
                   15.87%  4.8640us         1  4.8640us  4.8640us  4.8640us  __nv_static_45__32_dpRealComplex_compute_70_cpp1_ii_f01f2bc9__ZN13dpRealComplex24postprocessC2C_kernelMemIjdL9fftAxii_t1EEEvP7ComplexIT0_EPKS4_T_15coordDivisors_tIS8_E7coord_tIS8_ESC_S8_S3_10callback_t
                   13.99%  4.2880us         1  4.2880us  4.2880us  4.2880us  __nv_static_45__32_spRealComplex_compute_70_cpp1_ii_1f28721c__ZN13spRealComplex16repackC2C_kernelIjdEEvNS_19spRealComplexC2C_stIT_T0_EE
      API calls:   53.64%  475.65ms         3  158.55ms  16.120us  475.32ms  cudaFree
                   45.01%  399.09ms         2  199.54ms  37.262us  399.05ms  cudaMalloc
                    0.86%  7.6176ms       740  10.294us     184ns  471.01us  cuDeviceGetAttribute
                    0.30%  2.6620ms         8  332.75us  164.02us  663.66us  cuDeviceTotalMem
                    0.09%  776.17us         1  776.17us  776.17us  776.17us  cudaGetDeviceProperties
                    0.08%  673.41us         8  84.176us  67.968us  105.31us  cuDeviceGetName
                    0.02%  134.74us         4  33.685us  18.079us  73.410us  cudaLaunchKernel
                    0.01%  58.643us        66     888ns     488ns  14.882us  cudaFuncSetCacheConfig
                    0.00%  16.714us         1  16.714us  16.714us  16.714us  cudaDeviceSynchronize
                    0.00%  12.232us         7  1.7470us     553ns  4.9300us  cudaGetDevice
                    0.00%  10.524us        12     877ns     211ns  2.8290us  cuDeviceGet
                    0.00%  5.2310us         4  1.3070us     418ns  2.6750us  cuDeviceGetCount
                    0.00%  2.5030us         5     500ns     281ns  1.1900us  cudaGetLastError
                    0.00%  2.1100us         4     527ns     178ns  1.4280us  cudaPeekAtLastError
                    0.00%  1.1770us         1  1.1770us  1.1770us  1.1770us  cuInit
                    0.00%     973ns         1     973ns     973ns     973ns  cuDriverGetVersion
$

Note that the kernel sequence and signatures are different than yours. These differences may arise due to:

  1. GPU you are running on
  2. CUDA version
  3. Specific transform config parameters (e.g. size, use of callbacks, in-place or not, etc.)
  4. phase of the moon (just kidding)

It seems in my case there are sub-transforms of different sizes occurring, plus a “postprocess” operation and a “repack” operation.

In your case, it looks to me like you have 3 separate FFT “pieces”, plus a “postprocess” operation. That’s really just guesswork based on kernel names, however. Sometimes its easier to “connect the dots” if you provide the source code used to generate the FFT.

If you use the nvprof --print-gpu-trace option, you will see these operations in the sequence order that they occurred, which also may give you some information about what is happening under the hood. You can also demangle the kernel names (have to do that yourself). When I use print-gpu-trace, I get output like this:

$ nvprof --print-gpu-trace ./t81
==27715== NVPROF is profiling process 27715, command: ./t81
==27715== Profiling application: ./t81
==27715== Profiling result:
   Start  Duration            Grid Size      Block Size     Regs*    SSMem*    DSMem*           Device   Context    Stream  Name
1.05479s  4.1920us            (128 1 1)       (256 1 1)        23        0B        0B  Tesla P100-PCIE         1         7  __nv_static_45__32_spRealComplex_compute_70_cpp1_ii_1f28721c__ZN13spRealComplex16repackC2C_kernelIjdEEvNS_19spRealComplexC2C_stIT_T0_EE [852]
1.05483s  8.1280us             (16 1 1)         (8 8 4)        88  18.000KB        0B  Tesla P100-PCIE         1         7  void dpRadix0256C::kernel1Mem<unsigned int, double, fftDirection_t=-1, unsigned int=8, unsigned int=2, CONSTANT, ALL, WRITEBACK>(kernel_parameters_t<fft_mem_radix1_t, unsigned int, double>) [855]
1.05487s  8.8960us             (32 1 1)         (8 8 2)        80  9.0000KB        0B  Tesla P100-PCIE         1         7  void dpRadix0128C::kernel3Mem<unsigned int, double, fftDirection_t=-1, unsigned int=8, unsigned int=4, CONSTANT, ALL, WRITEBACK>(kernel_parameters_t<fft_mem_radix3_t, unsigned int, double>) [858]
1.05490s  4.8640us             (64 1 1)       (256 1 1)        28        0B        0B  Tesla P100-PCIE         1         7  __nv_static_45__32_dpRealComplex_compute_70_cpp1_ii_f01f2bc9__ZN13dpRealComplex24postprocessC2C_kernelMemIjdL9fftAxii_t1EEEvP7ComplexIT0_EPKS4_T_15coordDivisors_tIS8_E7coord_tIS8_ESC_S8_S3_10callback_t [862]

Regs: Number of registers used per CUDA thread. This number includes registers used internally by the CUDA driver and/or tools and can be more than what the compiler shows.
SSMem: Static shared memory allocated per CUDA block.
DSMem: Dynamic shared memory allocated per CUDA block.

so in my case, the repack kernel comes first, followed by 2 FFT operations, followed by the post-process kernel

By default, nvprof does demangling. Not sure why that is not happening for some of these kernels.