CUDA Pro Tip: nvprof is Your Handy Universal GPU Profiler

Originally published at: https://developer.nvidia.com/blog/cuda-pro-tip-nvprof-your-handy-universal-gpu-profiler/

CUDA 5 added a powerful new tool to the CUDA Toolkit: nvprof. nvprof is a command-line profiler available for Linux, Windows, and OS X. At first glance, nvprof seems to be just a GUI-less version of the graphical profiling features available in the NVIDIA Visual Profiler and NSight Eclipse edition. But nvprof is much more than…

Hello and thanks for the CUDA posts.

I wanted to ask you.I used the nvprof in the example (saxpy http://devblogs.nvidia.com/... and it gives me 7 registers.When I use --ptxas it gives me 3 registers.
Can you explain me that?

Thank you!

Hi George, thanks for your comment. I think I need more details to help. What exact command line are you using to compile? And what GPU / compute capability are you running on?

Hello,

I am running on 2.1 compute capability.

I used the commands

nvprof --print-gpu-trace ./run --benchmark -i=1

and

nvcc -o run test.cu --ptxas-options=-v

You are using the default architecture, which is sm_10. On sm_10, the code uses 3 registers. But your binary also includes PTX, which is JITed at load time to sm_21 when you run on your CC 2.1 GPU. See this pro tip: http://devblogs.nvidia.com/...

sm_21 requires more registers for the same code (but also has a larger register file).

When I run this:

nvcc -arch=sm_21 -o run saxpy.cu --ptxas-options=-v

I see this output:

c:\src\test>nvcc -arch=sm_21 -o run saxpy.cu --ptxas-options=-v
ptxas : info : 0 bytes gmem
ptxas : info : Compiling entry function '_Z5saxpyifPfS_' for 'sm_21'
ptxas : info : Function properties for _Z5saxpyifPfS_
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas : info : Used 6 registers, 56 bytes cmem[0]
Creating library run.lib and object run.exp

So 6 registers. However, running in nvprof still shows 7 registers. I'm not sure about the cause of this discrepancy but I will file a bug! Thanks!

Ok!Same output here.
So, I must always use the sm_21 (for 2.1capability).
So, --ptxas and nvprof must give the same results always?

Thank you!

You don't have to explicitly specify the arch version (sm_21), but if you want full control over what code is generated you might want to. I recommend you read my post linked above about fat binaries and JIT linking.

As I wrote I think the profiler *should* match the ptxas output, so I have filed an issue internally to figure that out.

Ok,thank you!

I got the answer. To support profiling (for example of concurrent kernels), the profiler has to patch kernel code with some additional instructions, sometimes consuming extra registers. So in this case it uses an extra register. You can verify this by running

nvprof --print-gpu-trace --concurrent-kernels-off ./run

This disables profiling of concurrent kernels (not needed for this app), and you will see the register count drop to 6.

Ok!Thanks for the tip!

Hello, I have one question about CSV file.

When I run nvpro --csv my.exe in the windows system.

I can't find my csv file in tmp floder.

Can I set the path for my csv file?

how can I do it ?

Hello
I see that nvprof can even profile the number of flop in the kernel (using the parameters as below). Also when I browse through the documentation (here http://docs.nvidia.com/cuda... it says flop_count_sp is "Number of single-precision floating-point operations executed by non-predicated threads (add, multiply, multiply-accumulate and special). Each multiply-accumulate operation contributes 2 to the count."
However when I run, the result of flop_count_sp (which is supposed to be flop_count_sp_add + flop_count_sp_mul + flop_count_sp_special + 2 * flop_count_sp_fma) but in my case I find that it does not include in the summation the value of "flop_count_sp_special".
Could you suggest me what I am supposed to use? Should I add this value to the sum of flop_count_sp or I should consider the formula does not include the value of "flop_count_sp_special"?
Also could you please tell me what are these special operations?

nvprof --metrics flops_sp --metrics flops_sp_add --metrics flops_sp_mul --metrics flops_sp_fma --metrics flops_sp_special myKernel args

Where myKernel is the name of my CUDA kernel which has some input arguments given by args.

Many Thanks in advance

Special functions are things like sin(), cos(), exp(), etc. If you are not using these in your code, then that explains why they would not be included. If you do use them, then I'm not sure what is wrong, but would need to see an example code that demonstrates the problem (you can share via GitHub Gist if you like).

Thanks for clearing my ignorance. Of course I do not have these special functions in my kernel.
Could you please make me aware what other functions apart from functions like sin(), cos(), etc could contribute the values for "flop_count_sp_special"?

As per your suggestion I have shared my code via Mercurial Hub in BitBucket.org
( https://bitbucket.org/rajgu... ).

Thanks in Advance.

Basically these: http://docs.nvidia.com/cuda...

Thank you, that was eye opening.

Can we output register value on the console in PTX assembly? Furthermore is there any way to get register addresses in PTX assembly?
Any help will be appreciated.
Thanks

I don't follow your questions. Can you clarify?

Thanks for your reply mark. Basically i am currently working on a research project in which i need to access the registers value and their addresses in order to recognize the pattern of their access in CUDA. That's why i need to print these values on the console or write them in a file. In short is there any way to use print or write command in PTX assembly.
Thank You

This isn't something the profiler supports. You can use inline PTX in the kernel code to write the values of specific registers to memory. Registers don't have "addresses", but you could write the register names with their values, since the names are explicit in your inline PTX. http://docs.nvidia.com/cuda...