I’m trying to figure out what is the computation load of a program that computes speech processing features. For the moment the program takes and audio file and computes the power spectrum based on a FFT execution (with a scalable number of points).
This is the sequence of commands I use:
1-Init fftplan (just once at the beginning)
2-malloc memsize = sizeof(float)*fftSize on the GPU
3-malloc sizeof(cufftComplex)*fftSize on the GPU
4-for each frame:
4.a- copy memsize data from CPU to GPU
4.d- cufftExecR2C from float* to cufftComplex* on the GPU
4.e- run a kernel that computes power spec from cufftComplex* to float* on the GPU
4.f- copy memory back from the GPU
5-free all memory allocations and destroy plan
I’m using the visual profiler and to obtain this table (with a FFT of 512 points)
#calls gpu time cpu time %GPU time
r2c_radix2_sp 128433 13,9722 17,3196 73,1400
cu_powerSpec 128433 2,0796 16,4702 10,8800
memcopy 128433 3,0513 15,9700
and for 1024 points:
r2c_radix4_sp 128433 18,53 17,14 76,64
cu_powerSpec 128433 1,97 16,33 8,14
memcopy 128433 3,68 15,2
(Note that the second and third column are the averaged us).
- What is CPU time? CUDA_Profiler_2.0.txt says:
"The ‘gputime’ and ‘cputime’ labels specify the actual chip
execution time and the driver execution time (including gputime),
respectively. Note that all times are in microseconds. "
In the case of FFT, all data is already on the GPU. So, how come I have
CPU usage for this function?
If the CPU time includes the GPU time, how come the CPU time
is higher r2c_radix4_sp when using 1024 points?
- Why memcopy doesn’t differentiate between CPU usage and GPU?
- Why cudaMalloc doesn’t appear on the table?