CUDA 2.0 and Visual Profiler 1.0.11 cudaprof reports are missing memcopy's

On a Linux node installed with a GeForce 8600, CUDA 1.0 and Visual Profiler 1.0 (Alpha January 2008), the Visual Profiler reports included both “memcopy” and kernel timings for all our benchmarks, as expected.

Then, we moved the GeForce 8600 into a new Linux node installed with CUDA 2.0 and Visual Profiler 1.0.11. And, upgraded the old Linux node to CUDA 2.0, leaving the Visual Profiler 1.0 (Alpha January 2008), and installed a Tesla C870 into this old CUDA upgraded node.

Now, the Visual Profiler 1.0 (Alpha January 2008) reports for the Tesla runs include the memcopy and kernel timings. But, the Visual Profiler 1.0.11 reports for the GeForce 8600 runs include only the kernel timings, and are missing “memcopy” timings.

Our benchmarks are (were) passing on GeForce 8600 for both CUDA 2.0 (and 1.0), and are passing on the Tesla C870.

Did we install the proper Visual Profiler (version 1.0.11) for CUDA 2.0?

Just out of curiousity. Is it missing all memcpy, or is just the count of memcpies lower than the actual number of memcpies done?

It’s missing all memcopy’s results whether the host PC logic employs cudaMemcpy() or cudaMemcpyAsync().

I’ve seen this behavior as well; it gets the first three, but misses the subsequent 2000 following the first kernel invocation.

Interesting, it works fine for me, have you tried enabling all profiler counters in session settings->configuration? (it needs 3 passes then) I know that e.g. not enabling “time stamp” has had several weird side effects for me (e.g. CPU time is still displayed but with nonsensical values).

This morning, I tried enabling the profile counters as well as the timestamp, and still no memcopy results with the cudaprof (Version 1.0.11).

I neglected to mention that these benchmarks are streaming benchmarks that employ cudaStream_t objects for concurrent PC/GPU global memory I/O and kernel function executions. I varied the number of cudaStream_t objects from 1 to 8, and cudaprof failed to report any memcopy results.

Yesterday, I executed other non-streaming benchmarks on the GeForce 8600. Here, cudaprof reports all the memcopy results as expected.

The working cudaprof (Alpa Version 1.0 January 2008) profiles benchmarks on the Tesla C870, which does not support asynchronous PC/GPU I/O with kernel executions, so all the streaming benchmarks degrade to serial and synchronous PC/GPU global memory I/O and kernel executions, one stream at a time.

After our IT administrator installs a copy of the older cudaprof on the GeForce node, I will attempt profiling the streaming benchmarks on the GeForce 8600, and see what I get…