About profiling (nvprof) output of cuFFT

I’m developing with NVIDIA’s XAVIER. I plan to implement fft using CUDA, get a profile and check the performance with NVIDIA Visual Profiler.
So I have a question.

  1. About the result of FFT of nvprof

LEN_X: 256
LEN_Y: 64
I have 256x64 complex data like, and I use 2D Cufft to calculate it. If you then get the profile, you’ll see two ffts, void_regular_fft (…) and void_vector_fft (…).
Since this is two-dimensional, do you mean that you did the FFT in the X and Y directions?

2 .About the arguments in ()

void_regular_fft (unsigned int = 64, unsigned int = 1, unsigned int = 8, unsigned int = 8, unsigned int = 0, unsigned int = 0, unsigned int = 2, unsigned int = 1, unsigned int = 1, unsigned int = 0, unsigned int, double> (kernel_arguments_t )

What is the meaning of the argument in () of void_regular_fft () above?

Hi,

Do you use cuFFT?

If not, it’s recommended to do so.
Since it is a CUDA accelerated library specific for the FFT problem.

If cuFFT is not an option, could you share which library do you use so we can give a further suggestion?
(Since we don’t find the corresponding void_regular_fft and void_vector_fft in our cuFFT document)

Thanks.

Thank you for your response.
Yes! I’m using cuFFT.

I will paste part of the source code and the result of profiling it with nvprof.
please confirm.
Input data of 256x64 is read from Excel (omitted), and it is calculated by cuFFT.

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <math.h>

#include <cuda_runtime.h>
#include <cufft.h>
#include <cufftXt.h>
//#include <helper_cuda.h>
//#include <helper_functions.h>

#define LEN_X 256
#define LEN_Y 64
#define BATCH 192

static float csv_buf[LEN_Y][LEN_X];

int main()
{
int n[2] = {LEN_Y,LEN_X};
int stat;

int x;
int y;
int pos;

cufftComplex *h_in_data;
cufftComplex *h_in_data_batch;
cufftComplex *h_out_data;
cufftComplex *d_in_data;
cufftComplex *d_out_data;

//for timer 
float elapsed_time_ms = 0.0f;

//define event
cudaEvent_t start,stop;

//create event
cudaEventCreate(&start);
cudaEventCreate(&stop);


h_in_data  = (cufftComplex *)malloc(sizeof(cufftComplex) * LEN_X * LEN_Y);
h_in_data_batch = (cufftComplex *)malloc(sizeof(cufftComplex) * LEN_X * LEN_Y * BATCH);//for copy Number of BATCH
h_out_data = (cufftComplex *)malloc(sizeof(cufftComplex) * LEN_X * LEN_Y * BATCH);

cudaMalloc(&d_in_data,sizeof(cufftComplex) * LEN_X * LEN_Y * BATCH);
cudaMalloc(&d_out_data,sizeof(cufftComplex) * LEN_X * LEN_Y * BATCH);

cudaMemset(d_in_data, 0, (sizeof(cufftComplex) * LEN_X * LEN_Y * BATCH));
cudaMemset(d_out_data, 0, (sizeof(cufftComplex) * LEN_X * LEN_Y * BATCH));


for(int i = 0; i < BATCH; i++)
{      
  memcpy(&h_in_data_batch[i * LEN_X * LEN_Y],h_in_data,sizeof(cufftComplex)* LEN_X * LEN_Y);
}


/*copy Host to device*/
cudaMemcpy(d_in_data,h_in_data_batch,sizeof(cufftComplex) * LEN_X * LEN_Y * BATCH, cudaMemcpyHostToDevice);

   
cufftHandle plan;
cufftPlanMany(&plan,2,n,NULL,1,1,NULL,1,1,CUFFT_C2C,BATCH);    

//Timer Start
cudaEventRecord(start,0);


cufftExecC2C(plan, d_in_data, d_out_data, CUFFT_FORWARD);

//Record event
cudaEventRecord(stop,0);
//sync event
cudaEventSynchronize(stop);
//measure time 
cudaEventElapsedTime(&elapsed_time_ms,start,stop);

//destroy event
cudaEventDestroy(start);
cudaEventDestroy(stop);

//copy device to HOST
cudaMemcpy(h_out_data, d_out_data, sizeof(cufftComplex) * LEN_X * LEN_Y * BATCH, cudaMemcpyDeviceToHost);

cufftDestroy(plan);
   
cudaFree(d_in_data);
cudaFree(d_out_data);

return 0;

}

/* EOF */

==10779== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 54.85% 5.1925ms 1 5.1925ms 5.1925ms 5.1925ms [CUDA memcpy DtoH]
29.27% 2.7707ms 1 2.7707ms 2.7707ms 2.7707ms [CUDA memcpy HtoD]
6.84% 647.82us 2 323.91us 323.19us 324.63us [CUDA memset]
4.67% 442.34us 1 442.34us 442.34us 442.34us void regular_fft<unsigned int=64, unsigned int=8, unsigned int=32, padding_t=1, twiddle_t=0, loadstore_modifier_t=2, layout_t=1, unsigned int, float>(kernel_arguments_t)
4.37% 413.79us 1 413.79us 413.79us 413.79us void vector_fft<unsigned int=256, unsigned int=16, unsigned int=1, padding_t=6, twiddle_t=0, loadstore_modifier_t=2, layout_t=0, unsigned int, float>(kernel_arguments_t)
API calls: 84.46% 1.75569s 4 438.92ms 377.88us 1.75433s cudaFree
14.71% 305.82ms 2 152.91ms 3.9040us 305.82ms cudaEventCreate
0.41% 8.5814ms 2 4.2907ms 2.9986ms 5.5827ms cudaMemcpy
0.33% 6.7578ms 3 2.2526ms 704.55us 5.2787ms cudaMalloc
0.05% 995.99us 1 995.99us 995.99us 995.99us cudaEventSynchronize
0.02% 313.04us 191 1.6380us 544ns 61.827us cuDeviceGetAttribute
0.01% 200.27us 2 100.13us 65.795us 134.47us cudaMemset
0.01% 144.01us 2 72.004us 41.795us 102.21us cudaLaunchKernel
0.00% 78.468us 1 78.468us 78.468us 78.468us cudaGetDeviceProperties
0.00% 63.075us 2 31.537us 17.185us 45.890us cudaEventRecord
0.00% 48.706us 7 6.9580us 1.9840us 20.417us cudaGetDevice
0.00% 34.306us 2 17.153us 9.1530us 25.153us cuDeviceTotalMem
0.00% 9.0240us 2 4.5120us 2.6560us 6.3680us cudaEventDestroy
0.00% 7.9040us 4 1.9760us 1.2160us 3.2640us cuDeviceGetCount
0.00% 7.1370us 1 7.1370us 7.1370us 7.1370us cudaEventElapsedTime
0.00% 6.9450us 1 6.9450us 6.9450us 6.9450us cuInit
0.00% 4.3200us 3 1.4400us 1.0560us 2.0800us cuDeviceGet
0.00% 4.0650us 2 2.0320us 1.8560us 2.2090us cuDeviceGetName
0.00% 2.5280us 1 2.5280us 2.5280us 2.5280us cuDriverGetVersion
0.00% 2.2080us 2 1.1040us 800ns 1.4080us cudaGetErrorString
0.00% 2.0800us 2 1.0400us 832ns 1.2480us cuDeviceGetUuid

I’m sorry it’s hard to read.

void regular_fft<unsigned int=64, unsigned int=8, unsigned int=32, padding_t=1, twiddle_t=0, loadstore_modifier_t=2, layout_t=1, unsigned int, float>(kernel_arguments_t)
and
void vector_fft<unsigned int=256, unsigned int=16, unsigned int=1, padding_t=6, twiddle_t=0, loadstore_modifier_t=2, layout_t=0, unsigned int, float>(kernel_arguments_t)

What does the above mean?
For example, padding_t = 6, twiddle_t = 0
About the meaning of the value of.

Regars,Masato

Hi,

They are both backend implementation for the cufftExecC2C.
You can add the execution time for cufftExecC2C.

Thanks.

Thank you for your response.

But I don’t understand much.

Can you explain a little more concretely?
What is “backend implementation”?

Hi,

Sorry that we cannot disclosure too much for our internal implementation.
In general, please add them together to represent cufftExecC2C.

Thanks.