CUDA 7.5: Pinpoint Performance Problems with Instruction-Level Profiling

Originally published at:

[Note: Thejaswi Rao also contributed to the code optimizations shown in this post.] Today NVIDIA released CUDA 7.5, the latest release of the powerful CUDA Toolkit. One of the most exciting new features in CUDA 7.5 is new Instruction-Level Profiling support in the NVIDIA Visual Profiler. This powerful new feature, available on Maxwell (GM200) and…

i cant setup cuda toolkit. My gtx 980ti is not vga cuda compatible? ??

You'll need to provide more information on the problems you are having. All NVIDIA GPUs are CUDA-compatible.

I upgraded to 7.5 on my Ubuntu host, but now I can't debug on Jetson TK1 target due to error "cuda-gdb version (7.5.123) is not compatible with cuda-gdbserver version (6.5.121)". Is there some way to get 7.5 on the TK1?

nvvp not shows the information in columns and rows (for example, Utilization (column) and stacks in "Kernel Performance is Bound By Instruction And Memory Latency".

I don't fully understand the question. Is there a figure from this post where you see something different? Which figure? Can you link to a screenshot showing what you see instead? Thanks!

In attached screenshots, you could see the difference

We tried with fermi GPU on win7 and could not reproduce this issue.
It seems you have taken both the screenshots on the same platform with same GPU, is that correct?
If you can give detailed steps to reproduce the issue along with the platform/operating system you are working on, it will be helpful for us to reproduce the issue quicker.

Yes, both screenshots are taken on the same computer and the same GPU, one running cuda 7.0 (OK) and the other running cuda 7.5 (not OK).
System is running Scientific Linux 6.7 x86_64 in a i7 processor with 8 GB RAM

We are unable to reproduce this behaviour on
CentOS-7/GTX 480 setup with the CUDA 7.5 Production release(7.5.18).
"Scientific Linux 6.7 x86_64" is not supported officially in CUDA 7.5.

Well... Scientific Linux is "not" supported officially in CUDA, but is very similar to CentOS... so... I suppose CentOS will return the same problem... But, if I have free time now, I will install a CentOS 6.x machine with CUDA 7.0 and 7.5

Found the same problem on a CentOS 6.6 machine with K80s. Have you fixed the problem?

Now, in a CentOS 7.0, both Cuda 7.0 and Cuda 7.5 runs OK and nvvp shows correctly the information in columns and rows (for example,
Utilization (column) and stacks in "Kernel Performance is Bound By
Instruction And Memory Latency".
So, in CentOS 7.x we could say "OK", but in CentOS 6.x (and SL-6.x) the problem persists...

Thanks for the tip!

Would you be able to post the modified source code (

great explanation. but, how can i do this Instruction-Level Profiling on command line via nvprof?