What do you understand by CPU time? CPU time, computational load, cuda prof


I’m trying to figure out what is the computation load of a program that computes speech processing features. For the moment the program takes and audio file and computes the power spectrum based on a FFT execution (with a scalable number of points).

This is the sequence of commands I use:

1-Init fftplan (just once at the beginning)
2-malloc memsize = sizeof(float)*fftSize on the GPU
3-malloc sizeof(cufftComplex)*fftSize on the GPU

4-for each frame:

4.a- copy memsize data from CPU to GPU
4.d- cufftExecR2C from float* to cufftComplex* on the GPU
4.e- run a kernel that computes power spec from cufftComplex* to float* on the GPU
4.f- copy memory back from the GPU

5-free all memory allocations and destroy plan

I’m using the visual profiler and to obtain this table (with a FFT of 512 points)

#calls    gpu time    cpu time    %GPU time

r2c_radix2_sp 128433 13,9722 17,3196 73,1400
cu_powerSpec 128433 2,0796 16,4702 10,8800
memcopy 128433 3,0513 15,9700

and for 1024 points:

r2c_radix4_sp 128433 18,53 17,14 76,64
cu_powerSpec 128433 1,97 16,33 8,14
memcopy 128433 3,68 15,2

(Note that the second and third column are the averaged us).

My questions.

  1. What is CPU time? CUDA_Profiler_2.0.txt says:

"The ‘gputime’ and ‘cputime’ labels specify the actual chip
execution time and the driver execution time (including gputime),
respectively. Note that all times are in microseconds. "

In the case of FFT, all data is already on the GPU. So, how come I have
CPU usage for this function?

If the CPU time includes the GPU time, how come the CPU time
is higher r2c_radix4_sp when using 1024 points?

  1. Why memcopy doesn’t differentiate between CPU usage and GPU?
  2. Why cudaMalloc doesn’t appear on the table?



In your host code are you using cudaMemcpy, or cudaThreadSynchronize? Both grind on the device until all threads have completed. Its the equivalent of putting a while loop around a non-blocking read(2) call.

I’m using cudaMemcpy in steps

4.a- and 4.f. Not using any cudaThreadSynchronize.

But this doesn’t answer what is CPU time for each function…does it?

Its documented in the cuda profiler.txt

The time includes gpu time as well – thats my remembrance

But I insist, how come for the 1024 points FFT the time of the CPU time is lower that the GPU time ?

How do you define the CPU time? The time of …?

This is not an answer 2 ur question…

but, I have seen some discrepancy in this time of the order 200 microsecs – is my remembrance…

GPU time is mesaured by counters inside the GPU. The CPU time is measured from the CPU - which could include interrupt time etc etc… So, therez some dilly dallying there…

Because the driver must initialize the grid, kernel arguments, bind textures, etc… and copy the configured data to the card before launching the kernel. This accounts for cputime > gputime.

I don’t know. There are some known bugs with the profiler in CUDA 2.0, but if I recall correctly they related to the timestamp field. If you can post a minimal code that demonstrates the problem, NVIDIA is usually very good about checking it out and filing a bug report in their system.

Presumably because the driver overhead of setting up the DMA transfer is minimal so cputime=gputime. I really don’t know.

It never has.

So if I understand correctly the times I obtained are not accurate, and this error of 200 microseconds

would make the CPU time lower than the GPU time. Is that correct?