CUDA Perfomance Profiling with Nvidia NSight in VS2010 - .nvreport report file

I did a trace of application

In this report file:

When I select “CUDA -> CUDA Summary” in the drop down

Under the Runtime API calls item in the table

% Time - 80.66


% Device Time - 15.46

All the other time percentages are nearly 0%

so my question here is that where is the rest of the 19.34% of Time and 84.54% of Device Time? That is, if they mean percentage to completely different ‘Total Time’ values?

I used thrust vectors to copy back and forth my data. In the “Memory Copy” section of this report, all the % Time values for memo copy for my run are apparently negligible.

But guess what, when I click the ‘summary’ link of the Runtime API Calls (which has its % Time value as high as 80.66), I immediately see that the culprit - ‘cudaMemcpy’ with its ‘Capture Time %’ value as high as 73.75 in this ‘Runtime API Calls Summary’ page.

so my question here is that

does this mean that my bottle neck are still those call to thrust::copy(), even the “Memo Copies” section of the report doesn’t show it?
and how can I really find the exact function call that is the most expensive to me in general?
how does timeline feature help with any of these?

See for a detailed answer.