cuda profiling how to calculate speed up?

hi everybody,
i have a C program (a molecular dynamics simulator) in which i paralleleized some functions.
Now i’d like to evaluate how much my new simulator with parallelized kernels is faster than the fully C implemented simulator.

When i started my project, i have used gprof to evaluate the heaviest functions on the C simulator, and i chose which ones could be parallelized efficiently.
Done this job, Is now possible doing the same evaluation with a mixed code simulator (using gprof)?
I’ve tried to put the -pg flag when i compile with nvcc, and then read the output file with gprof.
It lists all the function (the cuda calls too), but i can’t understand if it’s doing his evaluation in the right way.
I obtained a table with a very little time elapsed on the cuda functions (sometimes 0.00 ms, and i don’t know if it’s because this function is not evaluated or it takes a very small time, and so the approx is to 0.00ms).
If it’s evaluating all the computations, i could use the full time execution i obtained for the cuda&c program as a number to be compared to the time i had on the C basic simulator.

I used the visual cuda profiler to evaluate the full time execution on the gpu and see if it was suitable to which gprof calculated, but i found that i should have at least 1 ms per call, so i think gprof is not doing his job with cuda calls.
How can i obtain the evaluation? Do i have to evaluate time sparately and calculate the gap? can’t i use gprof?
Is there any other automatic way?

Other question: i’ve seen from the profile that these functions are called when i call the kernel:

cudaError cudaLaunch(char)…
__device_stub__z9…

are they errors and how can i fix these?

thank for your answers

The major problem is that CUDA kernel calls are asynchronous, so they don’t burn any CPU time. gprof can’t measure the execution time of asynchronous events. You have two alternatives - either make the kernels synchronous by adding host-gpu synchronization barriers (the cudaThreadSynchronize() call in the runtime API), or profile the cuda functions separately using the CUDA profiling mechanisms. There is a visual profiling application in the toolkit, but you can also use command line profiling with CUDA. There is a simple example showing how to use both gprof and CUDA profiling together is this thread. It isn’t as unified or perfect as just using gprof, but you will get what you need from it, I think.

As for your second question, that is normal. The compilation process uses C++ function name mangling and stub functions to expand runtime kernel calls into driver API calls. That is why the names and call graphs look a bit different to your code.

Thank you very much for the fast and complete answer.

So it seems that for the calculation of the full elapsed time i must: take the gpu execution time from the cuda visual profiler and add it to the time computed by gprof, which only consider the host time.

I hoped it could be more automatic since nvcc supports -pg flags…but i imagine that it’s only for C code written in the .cu file (actually nvcc use gcc for C code present in .cu file, so it’s coeherent).

i have now one more trouble…
our situation is this:
i have a c fun in a .c file. which calls a host fun in a .cu file. The host fun launches the kernel which is in the same .cu file.
gprof calculates a time elapsed for the c fun; does this time consider also the time interval in which gpu is active?

Not unless you do something to make the host function which calls the kernel wait until the kernel is completed (the cudaThreadSynchronize() call I already mentioned).

ok i’ll try this and i’ll see it it’s coehrent with cuda visual profiler results. thanks!