preview of NVIDIA Visual Profiler

Here’s an initial download of the CUDA Visual Profiler
See the full README.
For linux: tar xvfz CudaVisualProfiler_0.1_beta_linux.tar.gz
Windows: extract the zip contents from

NVIDIA CUDA Visual Profiler
Version 0.1 Beta

Published by
NVIDIA Corporation
2701 San Tomas Expressway
Santa Clara, CA 95050




Information furnished is believed to be accurate and reliable. However,
NVIDIA Corporation assumes no responsibility for the consequences of use
of such information or for any infringement of patents or other rights
of third parties that may result from its use. No license is granted by
implication or otherwise under any patent or patent rights of NVIDIA
Corporation. Specifications mentioned in this publication are subject
to change without notice. These materials supersedes and replaces all
information previously supplied. NVIDIA Corporation products are not
authorized for use as critical components in life support devices or
systems without express written approval of NVIDIA Corporation.


NVIDIA, CUDA, and the NVIDIA logo are trademarks or registered trademarks
of NVIDIA Corporation in the United States and other countries. Other
company and product names may be trademarks of the respective companies
with which they are associated.


© 2007-2008 by NVIDIA Corporation. All rights reserved.


  • Execute a CUDA program with profiling enabled and view the profiler output
    as a table. The table has the following columns for each GPU method:
    timestamp: Start time stamp
    method: GPU method name. This is either “memcopy” for memory copies
    or the name of a GPU kernel.
    GPU Time
    CPU Time
    Profiler counters:
    gld_incoherent : Number of non-coalesced global memory loads
    gld_coherent : Number of coalesced global memory loads
    gst_incoherent : Number of non-coalesced global memory stores
    gst_coherent : Number of coalesced global memory stores
    local_load : Number of local memory loads
    local_store : Number of local memory stores
    branch : Number of branch events (instruction and/or sync stack)
    divergent_branch : Number of divergent branches within a warp
    instructions : Number of dynamic instructions (in fetch)
    warp_serialize : Number of threads in a warp serialize based
    on address (GRF or constant)
    cta_launched : Number of CTAs launched on the PM TPC

       Please refer the "Interpreting Profiler Counters" section below for
       more information on profiler counters.
       Note that profiler counters are also referred to as profiler signals.
  • Display the summary profiler table. It has the following columns for each
    GPU method:
    method name
    number of calls
    total GPU time
    % age GPU time
    Total counts for each profiler counter.

  • Display various kinds of plots:

    • Summary profiling data bar plot
    • GPU Time Height plot
    • GPU Time Width plot
    • Profiler counter bar plot
    • Profiler output table column bar plot
  • Analysis of profiler output - lists out method with high number of:
    . incoherent stores
    . incoherent loads
    . warp serializations

  • Compare profiler output for multiple program runs of the same program
    or for different programs. Each program run is referred to as a session.

  • Save profiling data for multiple sessions. A group of sessions is referred
    to as a project.

  • Import/Export CUDA Profiler CSV format data


  • Summary profiling data bar plot
    . One bar for each method
    . Bars sorted in decreasing gpu time,
    . Bar length is proportional to cumulative gputime for a method

  • GPU Time Height Plot:
    It is a bar diagram in which the height of each bar is proportional
    to the GPU time for a method and a different bar color is assigned for
    each method. A legend is displayed which shows the color assignment for
    different methods. The width of each bar is fixed and the bars are displayed
    in the order in which the methods are executed.

    When the “fit in window” option is enabled the display is adjusted so as
    to fit all the bars in the displayed window width. In this case bars for
    multiple methods can overlap. The overlapped bars are displayed in decreasing
    order of height so that all the different bars are visible.

    When the “Show CPU Time” option is enabled the CPU time is shown as a
    bar in a different color on top of the GPU time bar. The height of this
    bar is proportional to the difference of CPU time and GPU time for the method.

  • GPU Time Width Plot:
    It is a bar diagram in which the width of each bar is proportional
    to the GPU time for a method and a different bar color is assigned for
    each method. A legend is displayed which shows the color assignment for
    different methods. The bars are displayed in the order in which the
    methods are executed. When time stamps are enabled the bars are positioned
    based on the time stamp.
    The height of each bar is based on the option chosen:
    a) fixed height : height is fixed.
    b) height proportional to instruction issue rate: the instruction issue
    rate for a method is equal to profiler “instructions” counter
    value divided by the gpu time for the method.
    c) height proportional to incoherent load + store rate: the incoherent load

    • store rate for a method is equal to the sum of profiler
      “gld_incoherent” and “gst_incoherent” counter values divided by the
      gpu time for the method.
  • Profiler counter bar plot
    It is a bar plot for profiler counter values for a method from the profiler
    output table or the summary table.
    . One bar for each profiler counter
    . Bars sorted in decreasing profiler counter value
    . Bar length is proportional to profiler counter value

  • Profiler output table column bar plot
    It is a bar plot for any column of values from the profiler output
    table or summary table
    . One bar for each row in the table
    . Bars sorted in decreasing column value
    . Bar length is proportional to column value

Is it possible to profile CUDA code that is being called from matlab (by the CUDA-MATLAB plugin) ? Otherwise I have to create quite some infrastructure to be able to call my CUDA code with the right inputs from a C-program…

Unfortunately not right now, unless you can generate an executable from the matlab code (I think matlab can do this, but I’m no expert).

I have already filed a feature request for profiling external programs as is supported by some CPU profilers.


Thanks for the preview version guys. It’s already a pretty useful tool. One feature I’d suggest for the Summary Table is some kind of % calculated for gld_, gst_ and branch/divergent_branch. Right now, looking at my table, it takes a lot of counting digits (or calculator punching) to tell if 48219562 divergent branches is a significant fraction of 531144446 total branches or not.

Thanks for this suggestion. I’m filing a feature request.


Ahh thanks! Matlab can generate an executable if you have bought the right toolbox indeed. But I am not sure if that works with mexfiles (which are offcourse needed for using CUDA from matlab) I’ll check the docs to see if I have to compare the price of the toolbox to the price of me making a C-wrapper.

You can profile a mex file with the CUDA profiler, if you have a matlab script with no gui.

Just follow these steps in the session setup.

  1. in Launch specify “/usr/local/bin/matlab”
  2. In Working directory select the directory of your matlab script
  3. in Arguments: -nojvm -nosplash -r name_of_matlab_script (with no .m)

Start the profiler.

It worked fine for me under Linux.

Ahhh, that is great! I think the program I am speeding up has indeed options to turn off all plotting. I’ll check with my colleague friday. That saves me a lot of work indeed (reading in .mat files in C was not something I was really looking forward too :D )


You can have plots, the session will run from command line.
I am not sure it will work if you are passing parameters through a GUI.

You may want to add an explicit exit at the end of the script if the profiler needs multiple runs.

Thanks for this release guys. It’s pretty cool.

A few thoughts though.

  • I can’t run my CUDA kernels from the interface since it’s part from a much bigger software and my kernels are gathered into a shared library. Fair enough. But that would be nice if we could import a .csv file directly when we open the software. At the moment I have to create an empty project (so I click on New project and I then cancel the window that pops up) to get the import function available. It’s just a detail.

  • and why the function ‘Analyse signals’ doesn’t show the global CPU time? That’d be nice to have it as well to be able to average it (it’s actually the meaningful time to me).

RE: your first request, good idea, we’ll consider that.

RE: your second request, what exacty do you mean by “global” CPU time? Do you just mean the CPU time for each kernel launch?




Yes. What I mean by global is the total CPU time for all the calls, as we can read the total GPU time. But maybe you can already configure what appears when you click on ‘Analyse signals’ and I didn’t find it. Actually I can see the following columns: method, calls, GPU time, %GPU time, divergent_branch. But I profile more than that and CPU time, gld_incoherent, gst_incoherent and warp_serialize are not represented when I analyse the signals for instance. Is it normal? I found settings only for plots, not for the summary table. And that’d be nice to have all the things you profile showing up in the summary table.

First thanks for this tool, it is very useful !
It helped me optimize the performance of a kernel I was designing, as it showed that the float4 loads were all non-coalesced (which I didn’t know). By changing them to float2 loads and using shared memory to redistribute the data to the right thread, I was able to speed it up by more than 2X !

I have a few suggestions that I think would make it an even better tool:

1- When displaying c++ kernel names, it would be more readable to be able to demangle them (see c++filt and the GCC method abi::__cxa_demangle), with an option to only keep the method’s name and discard the namespace and arguments. Note that this could be done under Linux by post-processing the profiler output file using “c++filt -p | sed ‘s/^.*:://’”.
2- H2D, D2H and D2D memcpy operations should be distinguished, and it would be interesting to compute how much bandwidth is used in each direction (although we can compute it without using the profiler).
3- In cases where the GPU is used by both CUDA and OpenGL, it would be very interesting to know how much time is spent in each to know the percentage taken by rendering versus GPGPU computations. If it is not possible, then maybe recording buffer map/unmap operations would be helpful.
4- To be able to analyse time spent on the CPU-side as well as on the GPU, it would be helpful to record custom events/methods, using the same timestamps as GPU kernel. What I would like is to be able to see a second bar in the time width plot, showing CPU-side computations, which would show CPU-based pre/post processing steps, or which phase/dataset is currently being processed. The data for this can be output by custom code outside of CUDA, except that the timestamps must be in sync with the profiler’s output.

The last two suggestions might be too difficult to implement, but would be very helpful to analyze codes involving tens of kernels in a multi-step algorithm, involving gpgpu, rendering as well as purely CPU computations. This is the case in interactive simulations applications, which is the kind I’m currently interested in !


Very nice tool.

Is there any webpage or pdf where I can find the meaning of it all, because choosing blocksize 16*16 results in all stores co-alesced and other values don’t :D. I am pretty new to all this, so any “steps” that can make my code more co-alesced loading/writing which will result in small speed-ups would be great.

Best regards . .

Look for the supercomputing '07 slides. Coalescing is explained very nicely there.

Very nice tool indeed.

Yet, I got stuck at some point. I made some changes in the kernel, the new version, alone, runs fine, no problems.

But with this version, when I try to use the profiler, I get:

Error -94 in reading profiler output.
Minimum expected columns (method,gputime,cputime,occupancy) not found in profiler output file.

The session settings are:
max execution time, from 30 to 1000s
signal list: all signals enabled.
enable time stamp: not enabled.

The GPU takes about 100 ms to finish the program. previously, only 40 were required. (I enlarged the search area, so this in normal)

Regards, Bogdan

a) Do you see any other messages in the Visual Profiler output window?

b) Do all the 3 program runs complete normally - or do you have to abort them or do they get stopped after the timeout interval?

c) You could try and run the same CUDA program from the command prompt after setting the profiler environment variables - CUDA_PROFILE, CUDA_PROFILE_LOG, CUDA_PROFILE_CSV and CUDA_PROFILE_CONFIG. Check that the profiler log is correct. Refer the profiler document CUDA_Profiler_1.1.txt in the CUDA toolkit for details on these environment variables.

The complete messages are bellow, so they get stopped after the timeout interval. But, with the older verion of the kernel, it happened the same thing. And, even when I increased the max timeout interval to 1000 s, it didn’t finish:

=== Start profiling for session ‘Session6’ ===

Start program ‘D:/…_2_f/Release-vc7/example.exe’ run #1

Program run #1 was aborted after maximum program execution time duration of 30 seconds.

Start program ‘D:/…_2_f/Release-vc7/example.exe’ run #2

Program run #2 was aborted after maximum program execution time duration of 30 seconds.

Start program ‘D:/…_2_f/Release-vc7/example.exe’ run #3

Program run #3 was aborted after maximum program execution time duration of 30 seconds.

Error -94 in reading profiler output.

Minimum expected columns (method,gputime,cputime,occupancy) not found in profiler output file.

Do your programs stop in time when running from commandline?
Or do you need to press a key to end the program?


I didn’t think about it. Now it works. I wonder why it worked with previous kernel.

Thanks, I spent some time with it.

All the best!