preview of NVIDIA Visual Profiler

e.ping · January 18, 2008, 11:56pm

Here’s an initial download of the CUDA Visual Profiler
See the full README.
For linux: tar xvfz CudaVisualProfiler_0.1_beta_linux.tar.gz
Windows: extract the zip contents from CudaVisualProfiler_0.1_beta_windows.zip

NVIDIA CUDA Visual Profiler
Version 0.1 Beta

Published by
NVIDIA Corporation
2701 San Tomas Expressway
Santa Clara, CA 95050

Notice

BY DOWNLOADING THIS FILE, USER AGREES TO THE FOLLOWING:

ALL NVIDIA SOFTWARE, DESIGN SPECIFICATIONS, REFERENCE BOARDS, FILES,
DRAWINGS, DIAGNOSTICS, LISTS, AND OTHER DOCUMENTS (TOGETHER AND SEPARATELY,
“MATERIALS”) ARE BEING PROVIDED “AS IS”. NVIDIA MAKES NO WARRANTIES,
EXPRESSED, IMPLIED, STATUTORY, OR OTHERWISE WITH RESPECT TO THE MATERIALS,
AND EXPRESSLY DISCLAIMS ALL IMPLIED WARRANTIES OF NONINFRINGEMENT,
MERCHANTABILITY, AND FITNESS FOR A PARTICULAR PURPOSE.

Information furnished is believed to be accurate and reliable. However,
NVIDIA Corporation assumes no responsibility for the consequences of use
of such information or for any infringement of patents or other rights
of third parties that may result from its use. No license is granted by
implication or otherwise under any patent or patent rights of NVIDIA
Corporation. Specifications mentioned in this publication are subject
to change without notice. These materials supersedes and replaces all
information previously supplied. NVIDIA Corporation products are not
authorized for use as critical components in life support devices or
systems without express written approval of NVIDIA Corporation.

Trademarks

NVIDIA, CUDA, and the NVIDIA logo are trademarks or registered trademarks
of NVIDIA Corporation in the United States and other countries. Other
company and product names may be trademarks of the respective companies
with which they are associated.

Copyright

LIST OF SUPPORTED FEATURES:

Execute a CUDA program with profiling enabled and view the profiler output
as a table. The table has the following columns for each GPU method:
timestamp: Start time stamp
method: GPU method name. This is either “memcopy” for memory copies
or the name of a GPU kernel.
GPU Time
CPU Time
Occupancy
Profiler counters:
gld_incoherent : Number of non-coalesced global memory loads
gld_coherent : Number of coalesced global memory loads
gst_incoherent : Number of non-coalesced global memory stores
gst_coherent : Number of coalesced global memory stores
local_load : Number of local memory loads
local_store : Number of local memory stores
branch : Number of branch events (instruction and/or sync stack)
divergent_branch : Number of divergent branches within a warp
instructions : Number of dynamic instructions (in fetch)
warp_serialize : Number of threads in a warp serialize based
on address (GRF or constant)
cta_launched : Number of CTAs launched on the PM TPC
```
   Please refer the "Interpreting Profiler Counters" section below for
   more information on profiler counters.
   Note that profiler counters are also referred to as profiler signals.
```
Display the summary profiler table. It has the following columns for each
GPU method:
method name
number of calls
total GPU time
% age GPU time
Total counts for each profiler counter.
Display various kinds of plots:
- Summary profiling data bar plot
- GPU Time Height plot
- GPU Time Width plot
- Profiler counter bar plot
- Profiler output table column bar plot
Analysis of profiler output - lists out method with high number of:
. incoherent stores
. incoherent loads
. warp serializations
Compare profiler output for multiple program runs of the same program
or for different programs. Each program run is referred to as a session.
Save profiling data for multiple sessions. A group of sessions is referred
to as a project.
Import/Export CUDA Profiler CSV format data

DESCRIPTION OF DIFFERENT PLOTS:

Summary profiling data bar plot
. One bar for each method
. Bars sorted in decreasing gpu time,
. Bar length is proportional to cumulative gputime for a method
GPU Time Height Plot:
It is a bar diagram in which the height of each bar is proportional
to the GPU time for a method and a different bar color is assigned for
each method. A legend is displayed which shows the color assignment for
different methods. The width of each bar is fixed and the bars are displayed
in the order in which the methods are executed.

When the “fit in window” option is enabled the display is adjusted so as
to fit all the bars in the displayed window width. In this case bars for
multiple methods can overlap. The overlapped bars are displayed in decreasing
order of height so that all the different bars are visible.

When the “Show CPU Time” option is enabled the CPU time is shown as a
bar in a different color on top of the GPU time bar. The height of this
bar is proportional to the difference of CPU time and GPU time for the method.
GPU Time Width Plot:
It is a bar diagram in which the width of each bar is proportional
to the GPU time for a method and a different bar color is assigned for
each method. A legend is displayed which shows the color assignment for
different methods. The bars are displayed in the order in which the
methods are executed. When time stamps are enabled the bars are positioned
based on the time stamp.
The height of each bar is based on the option chosen:
a) fixed height : height is fixed.
b) height proportional to instruction issue rate: the instruction issue
rate for a method is equal to profiler “instructions” counter
value divided by the gpu time for the method.
c) height proportional to incoherent load + store rate: the incoherent load
- store rate for a method is equal to the sum of profiler
  “gld_incoherent” and “gst_incoherent” counter values divided by the
  gpu time for the method.
Profiler counter bar plot
It is a bar plot for profiler counter values for a method from the profiler
output table or the summary table.
. One bar for each profiler counter
. Bars sorted in decreasing profiler counter value
. Bar length is proportional to profiler counter value
Profiler output table column bar plot
It is a bar plot for any column of values from the profiler output
table or summary table
. One bar for each row in the table
. Bars sorted in decreasing column value
. Bar length is proportional to column value

DenisR · January 19, 2008, 10:52pm

Is it possible to profile CUDA code that is being called from matlab (by the CUDA-MATLAB plugin) ? Otherwise I have to create quite some infrastructure to be able to call my CUDA code with the right inputs from a C-program…

Mark_Harris · January 22, 2008, 1:16pm

Unfortunately not right now, unless you can generate an executable from the matlab code (I think matlab can do this, but I’m no expert).

I have already filed a feature request for profiling external programs as is supported by some CPU profilers.

Mark

MisterAnderson42 · January 22, 2008, 3:46pm

Thanks for the preview version guys. It’s already a pretty useful tool. One feature I’d suggest for the Summary Table is some kind of % calculated for gld_, gst_ and branch/divergent_branch. Right now, looking at my table, it takes a lot of counting digits (or calculator punching) to tell if 48219562 divergent branches is a significant fraction of 531144446 total branches or not.

Mark_Harris · January 23, 2008, 1:56pm

Thanks for this suggestion. I’m filing a feature request.

Mark

DenisR · January 23, 2008, 6:48pm

Ahh thanks! Matlab can generate an executable if you have bought the right toolbox indeed. But I am not sure if that works with mexfiles (which are offcourse needed for using CUDA from matlab) I’ll check the docs to see if I have to compare the price of the toolbox to the price of me making a C-wrapper.

mfatica · January 23, 2008, 7:40pm

You can profile a mex file with the CUDA profiler, if you have a matlab script with no gui.

Just follow these steps in the session setup.

in Launch specify “/usr/local/bin/matlab”
In Working directory select the directory of your matlab script
in Arguments: -nojvm -nosplash -r name_of_matlab_script (with no .m)

Start the profiler.

It worked fine for me under Linux.

DenisR · January 23, 2008, 9:00pm

Ahhh, that is great! I think the program I am speeding up has indeed options to turn off all plotting. I’ll check with my colleague friday. That saves me a lot of work indeed (reading in .mat files in C was not something I was really looking forward too :D )

Thanks!

mfatica · January 23, 2008, 9:06pm

You can have plots, the session will run from command line.
I am not sure it will work if you are passing parameters through a GUI.

You may want to add an explicit exit at the end of the script if the profiler needs multiple runs.

Morph208 · January 24, 2008, 1:19am

Thanks for this release guys. It’s pretty cool.

A few thoughts though.

I can’t run my CUDA kernels from the interface since it’s part from a much bigger software and my kernels are gathered into a shared library. Fair enough. But that would be nice if we could import a .csv file directly when we open the software. At the moment I have to create an empty project (so I click on New project and I then cancel the window that pops up) to get the import function available. It’s just a detail.
and why the function ‘Analyse signals’ doesn’t show the global CPU time? That’d be nice to have it as well to be able to average it (it’s actually the meaningful time to me).

Mark_Harris · January 24, 2008, 4:13pm

RE: your first request, good idea, we’ll consider that.

RE: your second request, what exacty do you mean by “global” CPU time? Do you just mean the CPU time for each kernel launch?

Thanks,

Mark

Morph208 · January 25, 2008, 12:00am

Thanks.

Yes. What I mean by global is the total CPU time for all the calls, as we can read the total GPU time. But maybe you can already configure what appears when you click on ‘Analyse signals’ and I didn’t find it. Actually I can see the following columns: method, calls, GPU time, %GPU time, divergent_branch. But I profile more than that and CPU time, gld_incoherent, gst_incoherent and warp_serialize are not represented when I analyse the signals for instance. Is it normal? I found settings only for plots, not for the summary table. And that’d be nice to have all the things you profile showing up in the summary table.

hufo · January 30, 2008, 1:51pm

First thanks for this tool, it is very useful !
It helped me optimize the performance of a kernel I was designing, as it showed that the float4 loads were all non-coalesced (which I didn’t know). By changing them to float2 loads and using shared memory to redistribute the data to the right thread, I was able to speed it up by more than 2X !

I have a few suggestions that I think would make it an even better tool:

1- When displaying c++ kernel names, it would be more readable to be able to demangle them (see c++filt and the GCC method abi::__cxa_demangle), with an option to only keep the method’s name and discard the namespace and arguments. Note that this could be done under Linux by post-processing the profiler output file using “c++filt -p | sed ‘s/^.*:External Image/’”.
2- H2D, D2H and D2D memcpy operations should be distinguished, and it would be interesting to compute how much bandwidth is used in each direction (although we can compute it without using the profiler).
3- In cases where the GPU is used by both CUDA and OpenGL, it would be very interesting to know how much time is spent in each to know the percentage taken by rendering versus GPGPU computations. If it is not possible, then maybe recording buffer map/unmap operations would be helpful.
4- To be able to analyse time spent on the CPU-side as well as on the GPU, it would be helpful to record custom events/methods, using the same timestamps as GPU kernel. What I would like is to be able to see a second bar in the time width plot, showing CPU-side computations, which would show CPU-based pre/post processing steps, or which phase/dataset is currently being processed. The data for this can be output by custom code outside of CUDA, except that the timestamps must be in sync with the profiler’s output.

The last two suggestions might be too difficult to implement, but would be very helpful to analyze codes involving tens of kernels in a multi-step algorithm, involving gpgpu, rendering as well as purely CPU computations. This is the case in interactive simulations applications, which is the kind I’m currently interested in !

Thanks,
Jeremie.

Nielske · January 30, 2008, 2:48pm

Very nice tool.

Is there any webpage or pdf where I can find the meaning of it all, because choosing blocksize 16*16 results in all stores co-alesced and other values don’t :D. I am pretty new to all this, so any “steps” that can make my code more co-alesced loading/writing which will result in small speed-ups would be great.

Best regards . .
Niels

DenisR · January 30, 2008, 7:04pm

Look for the supercomputing '07 slides. Coalescing is explained very nicely there.

bog · January 31, 2008, 1:15pm

Very nice tool indeed.

Yet, I got stuck at some point. I made some changes in the kernel, the new version, alone, runs fine, no problems.

But with this version, when I try to use the profiler, I get:

Error -94 in reading profiler output.
Minimum expected columns (method,gputime,cputime,occupancy) not found in profiler output file.

The session settings are:
max execution time, from 30 to 1000s
signal list: all signals enabled.
enable time stamp: not enabled.

The GPU takes about 100 ms to finish the program. previously, only 40 were required. (I enlarged the search area, so this in normal)

Regards, Bogdan

Sanjiv.Satoor · January 31, 2008, 1:35pm

a) Do you see any other messages in the Visual Profiler output window?

b) Do all the 3 program runs complete normally - or do you have to abort them or do they get stopped after the timeout interval?

c) You could try and run the same CUDA program from the command prompt after setting the profiler environment variables - CUDA_PROFILE, CUDA_PROFILE_LOG, CUDA_PROFILE_CSV and CUDA_PROFILE_CONFIG. Check that the profiler log is correct. Refer the profiler document CUDA_Profiler_1.1.txt in the CUDA toolkit for details on these environment variables.

bog · January 31, 2008, 2:32pm

The complete messages are bellow, so they get stopped after the timeout interval. But, with the older verion of the kernel, it happened the same thing. And, even when I increased the max timeout interval to 1000 s, it didn’t finish:

=== Start profiling for session ‘Session6’ ===

Start program ‘D:/…_2_f/Release-vc7/example.exe’ run #1 …

Program run #1 was aborted after maximum program execution time duration of 30 seconds.

Start program ‘D:/…_2_f/Release-vc7/example.exe’ run #2 …

Program run #2 was aborted after maximum program execution time duration of 30 seconds.

Start program ‘D:/…_2_f/Release-vc7/example.exe’ run #3 …

Program run #3 was aborted after maximum program execution time duration of 30 seconds.

Error -94 in reading profiler output.

Minimum expected columns (method,gputime,cputime,occupancy) not found in profiler output file.

DenisR · January 31, 2008, 3:11pm

Do your programs stop in time when running from commandline?
Or do you need to press a key to end the program?

bog · January 31, 2008, 3:30pm

External Media

I didn’t think about it. Now it works. I wonder why it worked with previous kernel.

Thanks, I spent some time with it.

All the best!

Topic		Replies	Views
CUDA Toolkit and SDK v2.2 released CUDA Programming and Performance	59	64619	January 25, 2011
CUDA 2.1 discussion CUDA Programming and Performance	71	63938	February 17, 2009
Visual Profiler not working (Win XP 64 bit) getting errors related to the profiler output CUDA Programming and Performance	21	37781	August 17, 2010
An Even Easier Introduction to CUDA Technical Blog	141	6074	November 28, 2023
CUDA very slow performance CUDA Programming and Performance	21	16438	March 6, 2020
CUDA 2.2 beta features CUDA Programming and Performance	146	126071	May 19, 2009
What can't you do in CUDA that you'd like? Requests for the future CUDA Programming and Performance	407	134546	May 26, 2010
CUDA 2.1 beta CUDA Programming and Performance	49	67161	December 3, 2008
CUDA Toolkit 3.0 released CUDA Programming and Performance	62	26023	September 21, 2010
CUDA Toolkit 3.0 beta released now with public downloads CUDA Programming and Performance	104	430086	March 25, 2010

preview of NVIDIA Visual Profiler

NVIDIA CUDA Visual Profiler Version 0.1 Beta

LIST OF SUPPORTED FEATURES:

DESCRIPTION OF DIFFERENT PLOTS:

Related topics

NVIDIA CUDA Visual Profiler
Version 0.1 Beta