CUDA Pro Tip: nvprof is Your Handy Universal GPU Profiler

jwitsoe · October 31, 2013, 3:38pm

Originally published at: https://developer.nvidia.com/blog/cuda-pro-tip-nvprof-your-handy-universal-gpu-profiler/

CUDA 5 added a powerful new tool to the CUDA Toolkit: nvprof. nvprof is a command-line profiler available for Linux, Windows, and OS X. At first glance, nvprof seems to be just a GUI-less version of the graphical profiling features available in the NVIDIA Visual Profiler and NSight Eclipse edition. But nvprof is much more than…

anon40145633 · December 3, 2013, 3:06pm

Hello and thanks for the CUDA posts.

I wanted to ask you.I used the nvprof in the example (saxpy http://devblogs.nvidia.com/... and it gives me 7 registers.When I use --ptxas it gives me 3 registers.
Can you explain me that?

Thank you!

anon95180265 · December 4, 2013, 5:25pm

Hi George, thanks for your comment. I think I need more details to help. What exact command line are you using to compile? And what GPU / compute capability are you running on?

anon40145633 · December 4, 2013, 5:28pm

Hello,

I am running on 2.1 compute capability.

I used the commands

nvprof --print-gpu-trace ./run --benchmark -i=1

and

nvcc -o run test.cu --ptxas-options=-v

anon95180265 · December 4, 2013, 5:48pm

You are using the default architecture, which is sm_10. On sm_10, the code uses 3 registers. But your binary also includes PTX, which is JITed at load time to sm_21 when you run on your CC 2.1 GPU. See this pro tip: http://devblogs.nvidia.com/...

sm_21 requires more registers for the same code (but also has a larger register file).

When I run this:

nvcc -arch=sm_21 -o run saxpy.cu --ptxas-options=-v

I see this output:

c:\src\test>nvcc -arch=sm_21 -o run saxpy.cu --ptxas-options=-v
ptxas : info : 0 bytes gmem
ptxas : info : Compiling entry function '_Z5saxpyifPfS_' for 'sm_21'
ptxas : info : Function properties for _Z5saxpyifPfS_
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas : info : Used 6 registers, 56 bytes cmem[0]
Creating library run.lib and object run.exp

So 6 registers. However, running in nvprof still shows 7 registers. I'm not sure about the cause of this discrepancy but I will file a bug! Thanks!

anon40145633 · December 4, 2013, 5:52pm

Ok!Same output here.
So, I must always use the sm_21 (for 2.1capability).
So, --ptxas and nvprof must give the same results always?

Thank you!

anon95180265 · December 4, 2013, 6:12pm

You don't have to explicitly specify the arch version (sm_21), but if you want full control over what code is generated you might want to. I recommend you read my post linked above about fat binaries and JIT linking.

As I wrote I think the profiler *should* match the ptxas output, so I have filed an issue internally to figure that out.

anon40145633 · December 4, 2013, 6:13pm

Ok,thank you!

anon95180265 · December 4, 2013, 7:08pm

I got the answer. To support profiling (for example of concurrent kernels), the profiler has to patch kernel code with some additional instructions, sometimes consuming extra registers. So in this case it uses an extra register. You can verify this by running

nvprof --print-gpu-trace --concurrent-kernels-off ./run

This disables profiling of concurrent kernels (not needed for this app), and you will see the register count drop to 6.

anon40145633 · December 4, 2013, 7:10pm

Ok!Thanks for the tip!

anon77761579 · August 21, 2014, 8:01am

Hello, I have one question about CSV file.

When I run nvpro --csv my.exe in the windows system.

I can't find my csv file in tmp floder.

Can I set the path for my csv file?

how can I do it ?

anon78416780 · June 6, 2017, 11:02am

Hello
I see that nvprof can even profile the number of flop in the kernel (using the parameters as below). Also when I browse through the documentation (here http://docs.nvidia.com/cuda... it says flop_count_sp is "Number of single-precision floating-point operations executed by non-predicated threads (add, multiply, multiply-accumulate and special). Each multiply-accumulate operation contributes 2 to the count."
However when I run, the result of flop_count_sp (which is supposed to be flop_count_sp_add + flop_count_sp_mul + flop_count_sp_special + 2 * flop_count_sp_fma) but in my case I find that it does not include in the summation the value of "flop_count_sp_special".
Could you suggest me what I am supposed to use? Should I add this value to the sum of flop_count_sp or I should consider the formula does not include the value of "flop_count_sp_special"?
Also could you please tell me what are these special operations?

nvprof --metrics flops_sp --metrics flops_sp_add --metrics flops_sp_mul --metrics flops_sp_fma --metrics flops_sp_special myKernel args

Where myKernel is the name of my CUDA kernel which has some input arguments given by args.

Many Thanks in advance

anon95180265 · June 6, 2017, 11:13pm

Special functions are things like sin(), cos(), exp(), etc. If you are not using these in your code, then that explains why they would not be included. If you do use them, then I'm not sure what is wrong, but would need to see an example code that demonstrates the problem (you can share via GitHub Gist if you like).

anon78416780 · June 7, 2017, 9:31am

Thanks for clearing my ignorance. Of course I do not have these special functions in my kernel.
Could you please make me aware what other functions apart from functions like sin(), cos(), etc could contribute the values for "flop_count_sp_special"?

As per your suggestion I have shared my code via Mercurial Hub in BitBucket.org
( https://bitbucket.org/rajgu... ).

Thanks in Advance.

anon95180265 · June 7, 2017, 10:45pm

Basically these: http://docs.nvidia.com/cuda...

anon78416780 · June 8, 2017, 4:21am

Thank you, that was eye opening.

anon41615334 · January 14, 2018, 7:38pm

Can we output register value on the console in PTX assembly? Furthermore is there any way to get register addresses in PTX assembly?
Any help will be appreciated.
Thanks

anon95180265 · January 14, 2018, 9:49pm

I don't follow your questions. Can you clarify?

anon41615334 · January 15, 2018, 3:19am

Thanks for your reply mark. Basically i am currently working on a research project in which i need to access the registers value and their addresses in order to recognize the pattern of their access in CUDA. That's why i need to print these values on the console or write them in a file. In short is there any way to use print or write command in PTX assembly.
Thank You

anon95180265 · January 15, 2018, 5:00am

This isn't something the profiler supports. You can use inline PTX in the kernel code to write the values of specific registers to memory. Registers don't have "addresses", but you could write the register names with their values, since the names are explicit in your inline PTX. http://docs.nvidia.com/cuda...

Topic		Replies	Views
NVProf error on samples CUDA Programming and Performance	28	20457	December 29, 2020
nvprof never returns CUDA Programming and Performance	8	6311	March 30, 2016
Magic of nvprof --profile-api-trace none Visual Profiler and nvprof	4	891	March 27, 2023
Cannot profile RTX 2060 KO (TU104) with CUDA 11.0 on windows and ubuntu Visual Profiler and nvprof nvbugs	8	2755	July 27, 2020
Profiling and Optimizing Deep Neural Networks with DLProf and PyProf Technical Blog	13	1412	August 11, 2021
Can't Get NCU GUI To Import Properly Nsight Compute	8	1340	October 5, 2020
Always got this warning when nvprof cuda file "This can happen if device ran out of memory or if a device kernel was stopped due to an assertion" on just HellowWorld GPU CUDA Programming and Performance	9	2557	January 31, 2019
Using gprof with CUDA Can this profiler be used with liunx c and CUDA? CUDA Programming and Performance	6	7019	April 9, 2013
NVIDIA Tools Extension API (NVTX): Annotation Tool for Profiling Code in Python and C/C++ Technical Blog	1	613	October 17, 2022
CUDA Pro Tip: Generate Custom Application Profile Timelines with NVTX Technical Blog	6	643	September 19, 2022

CUDA Pro Tip: nvprof is Your Handy Universal GPU Profiler

Related topics