cuda profiling how to calculate speed up?

gibo · April 30, 2010, 8:28am

hi everybody,
i have a C program (a molecular dynamics simulator) in which i paralleleized some functions.
Now i’d like to evaluate how much my new simulator with parallelized kernels is faster than the fully C implemented simulator.

When i started my project, i have used gprof to evaluate the heaviest functions on the C simulator, and i chose which ones could be parallelized efficiently.
Done this job, Is now possible doing the same evaluation with a mixed code simulator (using gprof)?
I’ve tried to put the -pg flag when i compile with nvcc, and then read the output file with gprof.
It lists all the function (the cuda calls too), but i can’t understand if it’s doing his evaluation in the right way.
I obtained a table with a very little time elapsed on the cuda functions (sometimes 0.00 ms, and i don’t know if it’s because this function is not evaluated or it takes a very small time, and so the approx is to 0.00ms).
If it’s evaluating all the computations, i could use the full time execution i obtained for the cuda&c program as a number to be compared to the time i had on the C basic simulator.

I used the visual cuda profiler to evaluate the full time execution on the gpu and see if it was suitable to which gprof calculated, but i found that i should have at least 1 ms per call, so i think gprof is not doing his job with cuda calls.
How can i obtain the evaluation? Do i have to evaluate time sparately and calculate the gap? can’t i use gprof?
Is there any other automatic way?

Other question: i’ve seen from the profile that these functions are called when i call the kernel:

cudaError cudaLaunch(char)…
__device_stub__z9…

are they errors and how can i fix these?

thank for your answers

avidday · April 30, 2010, 8:41am

The major problem is that CUDA kernel calls are asynchronous, so they don’t burn any CPU time. gprof can’t measure the execution time of asynchronous events. You have two alternatives - either make the kernels synchronous by adding host-gpu synchronization barriers (the cudaThreadSynchronize() call in the runtime API), or profile the cuda functions separately using the CUDA profiling mechanisms. There is a visual profiling application in the toolkit, but you can also use command line profiling with CUDA. There is a simple example showing how to use both gprof and CUDA profiling together is this thread. It isn’t as unified or perfect as just using gprof, but you will get what you need from it, I think.

As for your second question, that is normal. The compilation process uses C++ function name mangling and stub functions to expand runtime kernel calls into driver API calls. That is why the names and call graphs look a bit different to your code.

gibo · April 30, 2010, 3:57pm

Thank you very much for the fast and complete answer.

So it seems that for the calculation of the full elapsed time i must: take the gpu execution time from the cuda visual profiler and add it to the time computed by gprof, which only consider the host time.

I hoped it could be more automatic since nvcc supports -pg flags…but i imagine that it’s only for C code written in the .cu file (actually nvcc use gcc for C code present in .cu file, so it’s coeherent).

gibo · May 4, 2010, 10:28am

i have now one more trouble…
our situation is this:
i have a c fun in a .c file. which calls a host fun in a .cu file. The host fun launches the kernel which is in the same .cu file.
gprof calculates a time elapsed for the c fun; does this time consider also the time interval in which gpu is active?

avidday · May 4, 2010, 10:40am

Not unless you do something to make the host function which calls the kernel wait until the kernel is completed (the cudaThreadSynchronize() call I already mentioned).

gibo · May 4, 2010, 7:48pm

ok i’ll try this and i’ll see it it’s coehrent with cuda visual profiler results. thanks!

Topic		Replies	Views
visual studio performance profiler on CUDA code CUDA Programming and Performance	1	6917	March 20, 2008
Issues with measuring speedup timing analysis for CUDA CUDA Programming and Performance	0	732	July 3, 2010
Optimisation using Visual profiler Some guess I would like to discuss with you CUDA Programming and Performance	5	1616	April 10, 2012
Compare GPU and CPU function time CUDA Programming and Performance	7	6302	May 30, 2011
Improving Cuda-kernels performance CUDA Programming and Performance	5	9310	February 10, 2009
How to get exact measurement of CPU and GPU running time? CUDA Programming and Performance cuda	2	1357	August 12, 2023
well how do I know if cuda runs on the gpu CUDA Programming and Performance	20	13010	July 9, 2008
cuda visual profiler CUDA Programming and Performance	12	8167	July 30, 2008
cudaLaunchHostFunc + cudaEventElapsedTime? CUDA Programming and Performance	4	792	August 3, 2022
Profiling in a code line resolution CUDA Programming and Performance	7	7046	December 6, 2011

cuda profiling how to calculate speed up?

Related topics