When you make a cufft call, that is a C-library function call. Within that function, any number of CUDA activities may transpire, such as kernel calls, CUDA API calls, etc. In providing a single FFT, CUFFT may choose to perform multiple kernel calls, and possibly other activity as well.
That’s a general description applicable to most CUDA library calls.
If you’re looking for a precise description of what the purpose of each of these 4 calls is, I doubt you’re going to get that.
Sometimes its possible to infer something about the activity based on the naming of the kernels. For example, when I do a double precision real-to-complex transform (D2Z) on a data set of size 65536, in-place, I get the following output:
$ cat t81.cu
#include <cufft.h>
#include <stdio.h>
int main(){
#define NX (256*256)
#define BATCH 1
cufftHandle plan;
cufftDoubleComplex *data;
cudaMalloc((void**)&data, sizeof(cufftDoubleComplex)*(NX/2+1)*BATCH);
if (cudaGetLastError() != cudaSuccess){ fprintf(stderr, "Cuda error: Failed to allocate\n"); return -1; }
if (cufftPlan1d(&plan, NX, CUFFT_D2Z, BATCH) != CUFFT_SUCCESS){ fprintf(stderr, "CUFFT error: Plan creation failed"); return -1; }
/* Use the CUFFT plan to transform the signal in place. */
if (cufftExecD2Z(plan, (cufftDoubleReal*)data, data) != CUFFT_SUCCESS){ fprintf(stderr, "CUFFT error: ExecC2C Forward failed"); return -1; }
if (cudaDeviceSynchronize() != cudaSuccess){ fprintf(stderr, "Cuda error: Failed to synchronize\n"); return -1; }
cufftDestroy(plan); cudaFree(data);
return 0;
}
$ nvcc -arch=sm_60 -o t81 t81.cu -lcufft
$ nvprof ./t81
==26624== NVPROF is profiling process 26624, command: ./t81
==26624== Profiling application: ./t81
==26624== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 40.19% 12.320us 1 12.320us 12.320us 12.320us void dpRadix0256C::kernel1Mem<unsigned int, double, fftDirection_t=-1, unsigned int=8, unsigned int=2, CONSTANT, ALL, WRITEBACK>(kernel_parameters_t<fft_mem_radix1_t, unsigned int, double>)
29.96% 9.1840us 1 9.1840us 9.1840us 9.1840us void dpRadix0128C::kernel3Mem<unsigned int, double, fftDirection_t=-1, unsigned int=8, unsigned int=4, CONSTANT, ALL, WRITEBACK>(kernel_parameters_t<fft_mem_radix3_t, unsigned int, double>)
15.87% 4.8640us 1 4.8640us 4.8640us 4.8640us __nv_static_45__32_dpRealComplex_compute_70_cpp1_ii_f01f2bc9__ZN13dpRealComplex24postprocessC2C_kernelMemIjdL9fftAxii_t1EEEvP7ComplexIT0_EPKS4_T_15coordDivisors_tIS8_E7coord_tIS8_ESC_S8_S3_10callback_t
13.99% 4.2880us 1 4.2880us 4.2880us 4.2880us __nv_static_45__32_spRealComplex_compute_70_cpp1_ii_1f28721c__ZN13spRealComplex16repackC2C_kernelIjdEEvNS_19spRealComplexC2C_stIT_T0_EE
API calls: 53.64% 475.65ms 3 158.55ms 16.120us 475.32ms cudaFree
45.01% 399.09ms 2 199.54ms 37.262us 399.05ms cudaMalloc
0.86% 7.6176ms 740 10.294us 184ns 471.01us cuDeviceGetAttribute
0.30% 2.6620ms 8 332.75us 164.02us 663.66us cuDeviceTotalMem
0.09% 776.17us 1 776.17us 776.17us 776.17us cudaGetDeviceProperties
0.08% 673.41us 8 84.176us 67.968us 105.31us cuDeviceGetName
0.02% 134.74us 4 33.685us 18.079us 73.410us cudaLaunchKernel
0.01% 58.643us 66 888ns 488ns 14.882us cudaFuncSetCacheConfig
0.00% 16.714us 1 16.714us 16.714us 16.714us cudaDeviceSynchronize
0.00% 12.232us 7 1.7470us 553ns 4.9300us cudaGetDevice
0.00% 10.524us 12 877ns 211ns 2.8290us cuDeviceGet
0.00% 5.2310us 4 1.3070us 418ns 2.6750us cuDeviceGetCount
0.00% 2.5030us 5 500ns 281ns 1.1900us cudaGetLastError
0.00% 2.1100us 4 527ns 178ns 1.4280us cudaPeekAtLastError
0.00% 1.1770us 1 1.1770us 1.1770us 1.1770us cuInit
0.00% 973ns 1 973ns 973ns 973ns cuDriverGetVersion
$
Note that the kernel sequence and signatures are different than yours. These differences may arise due to:
- GPU you are running on
- CUDA version
- Specific transform config parameters (e.g. size, use of callbacks, in-place or not, etc.)
- phase of the moon (just kidding)
It seems in my case there are sub-transforms of different sizes occurring, plus a “postprocess” operation and a “repack” operation.
In your case, it looks to me like you have 3 separate FFT “pieces”, plus a “postprocess” operation. That’s really just guesswork based on kernel names, however. Sometimes its easier to “connect the dots” if you provide the source code used to generate the FFT.
If you use the nvprof --print-gpu-trace option, you will see these operations in the sequence order that they occurred, which also may give you some information about what is happening under the hood. You can also demangle the kernel names (have to do that yourself). When I use print-gpu-trace, I get output like this:
$ nvprof --print-gpu-trace ./t81
==27715== NVPROF is profiling process 27715, command: ./t81
==27715== Profiling application: ./t81
==27715== Profiling result:
Start Duration Grid Size Block Size Regs* SSMem* DSMem* Device Context Stream Name
1.05479s 4.1920us (128 1 1) (256 1 1) 23 0B 0B Tesla P100-PCIE 1 7 __nv_static_45__32_spRealComplex_compute_70_cpp1_ii_1f28721c__ZN13spRealComplex16repackC2C_kernelIjdEEvNS_19spRealComplexC2C_stIT_T0_EE [852]
1.05483s 8.1280us (16 1 1) (8 8 4) 88 18.000KB 0B Tesla P100-PCIE 1 7 void dpRadix0256C::kernel1Mem<unsigned int, double, fftDirection_t=-1, unsigned int=8, unsigned int=2, CONSTANT, ALL, WRITEBACK>(kernel_parameters_t<fft_mem_radix1_t, unsigned int, double>) [855]
1.05487s 8.8960us (32 1 1) (8 8 2) 80 9.0000KB 0B Tesla P100-PCIE 1 7 void dpRadix0128C::kernel3Mem<unsigned int, double, fftDirection_t=-1, unsigned int=8, unsigned int=4, CONSTANT, ALL, WRITEBACK>(kernel_parameters_t<fft_mem_radix3_t, unsigned int, double>) [858]
1.05490s 4.8640us (64 1 1) (256 1 1) 28 0B 0B Tesla P100-PCIE 1 7 __nv_static_45__32_dpRealComplex_compute_70_cpp1_ii_f01f2bc9__ZN13dpRealComplex24postprocessC2C_kernelMemIjdL9fftAxii_t1EEEvP7ComplexIT0_EPKS4_T_15coordDivisors_tIS8_E7coord_tIS8_ESC_S8_S3_10callback_t [862]
Regs: Number of registers used per CUDA thread. This number includes registers used internally by the CUDA driver and/or tools and can be more than what the compiler shows.
SSMem: Static shared memory allocated per CUDA block.
DSMem: Dynamic shared memory allocated per CUDA block.
so in my case, the repack kernel comes first, followed by 2 FFT operations, followed by the post-process kernel
By default, nvprof does demangling. Not sure why that is not happening for some of these kernels.