Here’s an initial download of the CUDA Visual Profiler
See the full README.
For linux: tar xvfz CudaVisualProfiler_0.1_beta_linux.tar.gz
Windows: extract the zip contents from CudaVisualProfiler_0.1_beta_windows.zip
NVIDIA CUDA Visual Profiler
Version 0.1 Beta
Published by
NVIDIA Corporation
2701 San Tomas Expressway
Santa Clara, CA 95050
Notice
BY DOWNLOADING THIS FILE, USER AGREES TO THE FOLLOWING:
ALL NVIDIA SOFTWARE, DESIGN SPECIFICATIONS, REFERENCE BOARDS, FILES,
DRAWINGS, DIAGNOSTICS, LISTS, AND OTHER DOCUMENTS (TOGETHER AND SEPARATELY,
“MATERIALS”) ARE BEING PROVIDED “AS IS”. NVIDIA MAKES NO WARRANTIES,
EXPRESSED, IMPLIED, STATUTORY, OR OTHERWISE WITH RESPECT TO THE MATERIALS,
AND EXPRESSLY DISCLAIMS ALL IMPLIED WARRANTIES OF NONINFRINGEMENT,
MERCHANTABILITY, AND FITNESS FOR A PARTICULAR PURPOSE.
Information furnished is believed to be accurate and reliable. However,
NVIDIA Corporation assumes no responsibility for the consequences of use
of such information or for any infringement of patents or other rights
of third parties that may result from its use. No license is granted by
implication or otherwise under any patent or patent rights of NVIDIA
Corporation. Specifications mentioned in this publication are subject
to change without notice. These materials supersedes and replaces all
information previously supplied. NVIDIA Corporation products are not
authorized for use as critical components in life support devices or
systems without express written approval of NVIDIA Corporation.
Trademarks
NVIDIA, CUDA, and the NVIDIA logo are trademarks or registered trademarks
of NVIDIA Corporation in the United States and other countries. Other
company and product names may be trademarks of the respective companies
with which they are associated.
Execute a CUDA program with profiling enabled and view the profiler output
as a table. The table has the following columns for each GPU method:
timestamp: Start time stamp
method: GPU method name. This is either “memcopy” for memory copies
or the name of a GPU kernel.
GPU Time
CPU Time
Occupancy
Profiler counters:
gld_incoherent : Number of non-coalesced global memory loads
gld_coherent : Number of coalesced global memory loads
gst_incoherent : Number of non-coalesced global memory stores
gst_coherent : Number of coalesced global memory stores
local_load : Number of local memory loads
local_store : Number of local memory stores
branch : Number of branch events (instruction and/or sync stack)
divergent_branch : Number of divergent branches within a warp
instructions : Number of dynamic instructions (in fetch)
warp_serialize : Number of threads in a warp serialize based
on address (GRF or constant)
cta_launched : Number of CTAs launched on the PM TPC
Please refer the "Interpreting Profiler Counters" section below for
more information on profiler counters.
Note that profiler counters are also referred to as profiler signals.
Display the summary profiler table. It has the following columns for each
GPU method:
method name
number of calls
total GPU time
% age GPU time
Total counts for each profiler counter.
Display various kinds of plots:
Summary profiling data bar plot
GPU Time Height plot
GPU Time Width plot
Profiler counter bar plot
Profiler output table column bar plot
Analysis of profiler output - lists out method with high number of:
. incoherent stores
. incoherent loads
. warp serializations
Compare profiler output for multiple program runs of the same program
or for different programs. Each program run is referred to as a session.
Save profiling data for multiple sessions. A group of sessions is referred
to as a project.
Import/Export CUDA Profiler CSV format data
DESCRIPTION OF DIFFERENT PLOTS:
Summary profiling data bar plot
. One bar for each method
. Bars sorted in decreasing gpu time,
. Bar length is proportional to cumulative gputime for a method
GPU Time Height Plot:
It is a bar diagram in which the height of each bar is proportional
to the GPU time for a method and a different bar color is assigned for
each method. A legend is displayed which shows the color assignment for
different methods. The width of each bar is fixed and the bars are displayed
in the order in which the methods are executed.
When the “fit in window” option is enabled the display is adjusted so as
to fit all the bars in the displayed window width. In this case bars for
multiple methods can overlap. The overlapped bars are displayed in decreasing
order of height so that all the different bars are visible.
When the “Show CPU Time” option is enabled the CPU time is shown as a
bar in a different color on top of the GPU time bar. The height of this
bar is proportional to the difference of CPU time and GPU time for the method.
GPU Time Width Plot:
It is a bar diagram in which the width of each bar is proportional
to the GPU time for a method and a different bar color is assigned for
each method. A legend is displayed which shows the color assignment for
different methods. The bars are displayed in the order in which the
methods are executed. When time stamps are enabled the bars are positioned
based on the time stamp.
The height of each bar is based on the option chosen:
a) fixed height : height is fixed.
b) height proportional to instruction issue rate: the instruction issue
rate for a method is equal to profiler “instructions” counter
value divided by the gpu time for the method.
c) height proportional to incoherent load + store rate: the incoherent load
store rate for a method is equal to the sum of profiler
“gld_incoherent” and “gst_incoherent” counter values divided by the
gpu time for the method.
Profiler counter bar plot
It is a bar plot for profiler counter values for a method from the profiler
output table or the summary table.
. One bar for each profiler counter
. Bars sorted in decreasing profiler counter value
. Bar length is proportional to profiler counter value
Profiler output table column bar plot
It is a bar plot for any column of values from the profiler output
table or summary table
. One bar for each row in the table
. Bars sorted in decreasing column value
. Bar length is proportional to column value
Is it possible to profile CUDA code that is being called from matlab (by the CUDA-MATLAB plugin) ? Otherwise I have to create quite some infrastructure to be able to call my CUDA code with the right inputs from a C-program…
Thanks for the preview version guys. It’s already a pretty useful tool. One feature I’d suggest for the Summary Table is some kind of % calculated for gld_, gst_ and branch/divergent_branch. Right now, looking at my table, it takes a lot of counting digits (or calculator punching) to tell if 48219562 divergent branches is a significant fraction of 531144446 total branches or not.
Ahh thanks! Matlab can generate an executable if you have bought the right toolbox indeed. But I am not sure if that works with mexfiles (which are offcourse needed for using CUDA from matlab) I’ll check the docs to see if I have to compare the price of the toolbox to the price of me making a C-wrapper.
Ahhh, that is great! I think the program I am speeding up has indeed options to turn off all plotting. I’ll check with my colleague friday. That saves me a lot of work indeed (reading in .mat files in C was not something I was really looking forward too :D )
I can’t run my CUDA kernels from the interface since it’s part from a much bigger software and my kernels are gathered into a shared library. Fair enough. But that would be nice if we could import a .csv file directly when we open the software. At the moment I have to create an empty project (so I click on New project and I then cancel the window that pops up) to get the import function available. It’s just a detail.
and why the function ‘Analyse signals’ doesn’t show the global CPU time? That’d be nice to have it as well to be able to average it (it’s actually the meaningful time to me).
Yes. What I mean by global is the total CPU time for all the calls, as we can read the total GPU time. But maybe you can already configure what appears when you click on ‘Analyse signals’ and I didn’t find it. Actually I can see the following columns: method, calls, GPU time, %GPU time, divergent_branch. But I profile more than that and CPU time, gld_incoherent, gst_incoherent and warp_serialize are not represented when I analyse the signals for instance. Is it normal? I found settings only for plots, not for the summary table. And that’d be nice to have all the things you profile showing up in the summary table.
First thanks for this tool, it is very useful !
It helped me optimize the performance of a kernel I was designing, as it showed that the float4 loads were all non-coalesced (which I didn’t know). By changing them to float2 loads and using shared memory to redistribute the data to the right thread, I was able to speed it up by more than 2X !
I have a few suggestions that I think would make it an even better tool:
1- When displaying c++ kernel names, it would be more readable to be able to demangle them (see c++filt and the GCC method abi::__cxa_demangle), with an option to only keep the method’s name and discard the namespace and arguments. Note that this could be done under Linux by post-processing the profiler output file using “c++filt -p | sed ‘s/^.*:External Image/’”.
2- H2D, D2H and D2D memcpy operations should be distinguished, and it would be interesting to compute how much bandwidth is used in each direction (although we can compute it without using the profiler).
3- In cases where the GPU is used by both CUDA and OpenGL, it would be very interesting to know how much time is spent in each to know the percentage taken by rendering versus GPGPU computations. If it is not possible, then maybe recording buffer map/unmap operations would be helpful.
4- To be able to analyse time spent on the CPU-side as well as on the GPU, it would be helpful to record custom events/methods, using the same timestamps as GPU kernel. What I would like is to be able to see a second bar in the time width plot, showing CPU-side computations, which would show CPU-based pre/post processing steps, or which phase/dataset is currently being processed. The data for this can be output by custom code outside of CUDA, except that the timestamps must be in sync with the profiler’s output.
The last two suggestions might be too difficult to implement, but would be very helpful to analyze codes involving tens of kernels in a multi-step algorithm, involving gpgpu, rendering as well as purely CPU computations. This is the case in interactive simulations applications, which is the kind I’m currently interested in !
Is there any webpage or pdf where I can find the meaning of it all, because choosing blocksize 16*16 results in all stores co-alesced and other values don’t :D. I am pretty new to all this, so any “steps” that can make my code more co-alesced loading/writing which will result in small speed-ups would be great.
a) Do you see any other messages in the Visual Profiler output window?
b) Do all the 3 program runs complete normally - or do you have to abort them or do they get stopped after the timeout interval?
c) You could try and run the same CUDA program from the command prompt after setting the profiler environment variables - CUDA_PROFILE, CUDA_PROFILE_LOG, CUDA_PROFILE_CSV and CUDA_PROFILE_CONFIG. Check that the profiler log is correct. Refer the profiler document CUDA_Profiler_1.1.txt in the CUDA toolkit for details on these environment variables.
The complete messages are bellow, so they get stopped after the timeout interval. But, with the older verion of the kernel, it happened the same thing. And, even when I increased the max timeout interval to 1000 s, it didn’t finish:
=== Start profiling for session ‘Session6’ ===
Start program ‘D:/…_2_f/Release-vc7/example.exe’ run #1 …
Program run #1 was aborted after maximum program execution time duration of 30 seconds.
Start program ‘D:/…_2_f/Release-vc7/example.exe’ run #2 …
Program run #2 was aborted after maximum program execution time duration of 30 seconds.
Start program ‘D:/…_2_f/Release-vc7/example.exe’ run #3 …
Program run #3 was aborted after maximum program execution time duration of 30 seconds.
Error -94 in reading profiler output.
Minimum expected columns (method,gputime,cputime,occupancy) not found in profiler output file.