Visual profiler missing information


I ran the visual profiler (both 1.0 and 1.1) on my application, however, it doesn’t show all the memory copy instructions. I noticed it only shows the memory copy from the device to the host only. Is this an intended feature?

Additionally, in contrast to the consensus here, the CPU time is less than the GPU time. Its pretty much consistant (8-12 us). Can someone explain why this is the case?


I noticed the same thing in my tests. I even posted the question but got no answers :">

I noticed the same with the “MatrixMul” example given in NVIDIA CUDA SDK projects. I’m using the version 2.2.05 of the profiler.

I noticed the same with the “MatrixMul” example given in NVIDIA CUDA SDK projects.

Same issue here, thought I thought it was only a problem for the asynchronous memory copies I was using.

Edit: Clarification: I see no memory copies at all when using asynchronous.

I think it is the correct behaviour for async transfers. (i.e. they are not shown in the profiler). For the others… check if the “missing” copies are async.

This isn’t just an issue with the 2.2.x profiler, it’s been happening since the 1.x profilers too.

You’ll find if you run the profiler against it enough times, you’ll eventually get the missing data - it appears to vary from kernel to kernel.

I have on kernel where I’ve never been able to get an instruction count from it, and others where I only occasionally get memory transaction counts - just as an example.

It also depends on what kernel(s) have run before it, from my experience I tend to get the most reliable results if I run the kernel on it’s own (if at all possible).