Visual Profiler makes bandwidth 6x faster ?!?

Hello,

When Visual Profiler of Cuda Toolkit 7 RC1 is running and Skybuck’s Test CUDA Memory Performance Bandwidth application is run the bandwidth is suddenly 6 times higher ! It’s 12 GB/sec instead of 2 GB/sec (GT 520).

GPU-Z shows the GPU memory load to reach something like 70% when visual profiler is running and only 20% if visual profiler is not running.

What is going on here ? To me it seems that perhaps Windows or the Graphics Driver is intentionally limiting the GPU memory load, like it’s being bottlenecked ? Or perhaps Visual Profiler uses some trick to enable higher GPU memory load ? Or it’s some kind of feature being enabled inside this graphics driver ?!?

I’d like an explanation pls !

Bye,
Skybuck.

Visual profiler does various un-obvious things when profiling your code. For example, it may re-run kernels, something your application would not normally do. This can have an impact on performance measurements you build into your code. Furthermore, re-running a bandwidth-intensive kernel could easily cause the GPU-Z measurement to go up.

If you want to look at performance characteristics while visual profiler is running, you should look at the specific operations you care about in the timeline, and gather performance metrics on them directly from within visual profiler, either based on the data that is already captured, or else by running specific experiments (e.g. kernel bandwidth analysis)

The visual profiler does not produce and results. It runs, a bar is displayed… collecting measurements etc… then my bandwidth test runs… it’s my bandwidth test that reports high bandwidth in it’s graph… it’s gpu-z which report memory controller load of 70% instead of the usual 20%.

For now I cannot evaluate if the profiler and/or bandwidth test is running correctly because it doesn’t really compute anything… so no way to check if that is off… even if it were same… perhaps visual profiler runs some part of cuda kernel in cpu memory or so… or perhaps it uses other tricks.

Also why can visual profiler not profile my application which uses cuda driver api ?

Skybuck,

The Visual Profiler can trace a CUDA application and profile CUDA kernels for applications that use the CUDA runtime or CUDA driver API.

If you are using a 3rd party tool to monitor memory bandwidth utilization you will see different behavior in trace and profiling mode. In trace mode you should see negligible change in bandwidth. In profiling mode the Visual Profiler (nvprof/CUPTI) will enter kernel replay mode in order to collect all of the performance counters. Kernel replay will save all mutable memory before launching the kernel. After the kernel completes all mutable memory will be restored. The save logic will try to save device memory to available memory in performance order: device memory, pinned system memory, paged system memory, and finally disk. If your application allocates all device memory (which I believe your application does) then all device memory will need to be saved to system memory or disk. This can result in extremely long capture times. If you are monitoring memory bandwidth during save or restore you should see periods of maximum device bandwidth followed by reduced bandwidth as data is copied from device memory to system memory or back.

The Visual Profiler works with the CUDA runtime and CUDA driver API. If you are unable to collect results then I would recommend you post a reproducible so the development team can look into the issue.

I would recommend quickly changing the application to allocate a small portion of device memory and see if the Visual Profiler starts to work. If this is the case then either (a) save/restore is taking significant time, or (b) there is a bug in the tool.

Here is the “reproducible”.

http://www.skybuck.org/CUDA/BandwidthTest/version%200.09/Packed/TestCudaMemoryBandwidthPerformance.rar

If you guys have an GT 520 and Cuda Toolkit 7 RC1 installed with matching driver you can test for yourself and see if it works. On my system (windows 7 ultimate x64) it does not work.

Maybe I could try and make a 64 bit version instead of 32 bit version but I doubt that will solve anything ;).