Visual Profiler makes bandwidth 6x faster ?!?

Skybuck · February 18, 2015, 2:24pm

Hello,

When Visual Profiler of Cuda Toolkit 7 RC1 is running and Skybuck’s Test CUDA Memory Performance Bandwidth application is run the bandwidth is suddenly 6 times higher ! It’s 12 GB/sec instead of 2 GB/sec (GT 520).

GPU-Z shows the GPU memory load to reach something like 70% when visual profiler is running and only 20% if visual profiler is not running.

What is going on here ? To me it seems that perhaps Windows or the Graphics Driver is intentionally limiting the GPU memory load, like it’s being bottlenecked ? Or perhaps Visual Profiler uses some trick to enable higher GPU memory load ? Or it’s some kind of feature being enabled inside this graphics driver ?!?

I’d like an explanation pls !

Bye,
Skybuck.

Robert_Crovella · February 18, 2015, 2:41pm

Visual profiler does various un-obvious things when profiling your code. For example, it may re-run kernels, something your application would not normally do. This can have an impact on performance measurements you build into your code. Furthermore, re-running a bandwidth-intensive kernel could easily cause the GPU-Z measurement to go up.

If you want to look at performance characteristics while visual profiler is running, you should look at the specific operations you care about in the timeline, and gather performance metrics on them directly from within visual profiler, either based on the data that is already captured, or else by running specific experiments (e.g. kernel bandwidth analysis)

Skybuck · February 18, 2015, 3:09pm

The visual profiler does not produce and results. It runs, a bar is displayed… collecting measurements etc… then my bandwidth test runs… it’s my bandwidth test that reports high bandwidth in it’s graph… it’s gpu-z which report memory controller load of 70% instead of the usual 20%.

For now I cannot evaluate if the profiler and/or bandwidth test is running correctly because it doesn’t really compute anything… so no way to check if that is off… even if it were same… perhaps visual profiler runs some part of cuda kernel in cpu memory or so… or perhaps it uses other tricks.

Also why can visual profiler not profile my application which uses cuda driver api ?

Greg · February 18, 2015, 6:37pm

Skybuck,

The Visual Profiler can trace a CUDA application and profile CUDA kernels for applications that use the CUDA runtime or CUDA driver API.

If you are using a 3rd party tool to monitor memory bandwidth utilization you will see different behavior in trace and profiling mode. In trace mode you should see negligible change in bandwidth. In profiling mode the Visual Profiler (nvprof/CUPTI) will enter kernel replay mode in order to collect all of the performance counters. Kernel replay will save all mutable memory before launching the kernel. After the kernel completes all mutable memory will be restored. The save logic will try to save device memory to available memory in performance order: device memory, pinned system memory, paged system memory, and finally disk. If your application allocates all device memory (which I believe your application does) then all device memory will need to be saved to system memory or disk. This can result in extremely long capture times. If you are monitoring memory bandwidth during save or restore you should see periods of maximum device bandwidth followed by reduced bandwidth as data is copied from device memory to system memory or back.

The Visual Profiler works with the CUDA runtime and CUDA driver API. If you are unable to collect results then I would recommend you post a reproducible so the development team can look into the issue.

I would recommend quickly changing the application to allocate a small portion of device memory and see if the Visual Profiler starts to work. If this is the case then either (a) save/restore is taking significant time, or (b) there is a bug in the tool.

Skybuck · February 18, 2015, 6:57pm

Here is the “reproducible”.

http://www.skybuck.org/CUDA/BandwidthTest/version%200.09/Packed/TestCudaMemoryBandwidthPerformance.rar

If you guys have an GT 520 and Cuda Toolkit 7 RC1 installed with matching driver you can test for yourself and see if it works. On my system (windows 7 ultimate x64) it does not work.

Maybe I could try and make a 64 bit version instead of 32 bit version but I doubt that will solve anything ;).

Topic		Replies	Views
CUDA visual profiler CUDA Programming and Performance	1	1017	May 5, 2010
Strange behavior with CUDA Visual Profiler CUDA Programming and Performance	0	1960	February 16, 2012
cuda profiler -> cannot get performance values problem with some profiler counters being skipped CUDA Programming and Performance	0	882	March 13, 2011
VisualProfiler ver 2.2 CUDA Programming and Performance	13	4887	April 10, 2009
kernel runs much faster when being profiled with Visual Profiler Visual Profiler and nvprof	4	4702	August 29, 2014
visual profiler with compute capability 1.0 cards? CUDA Programming and Performance	9	5223	September 12, 2008
Visual Profiler reports higher than possible global mem throughput CUDA Programming and Performance	2	855	July 30, 2010
Updated beta visual profiler v0.2 CUDA Programming and Performance	0	2895	April 23, 2008
How to explain the performance difference? CUDA Programming and Performance	7	3532	March 26, 2008
preview of NVIDIA Visual Profiler CUDA Programming and Performance	76	89170	May 18, 2010

Visual Profiler makes bandwidth 6x faster ?!?

Related topics