How to get nvprof equivalent of nvprof metrics --query-metrics

g900nvda · November 23, 2024, 6:49am

Book I am studying from fairly old and uses now defunct nvprof for various profiling.
It uses following for branch occupancy:
nvprof metrics --branch_efficiency
But it complains that the nvprof is too old for CC 7.5, to get it work, I either have to use very old cuda toolkit that supports CC 7.5 or below. It then suggests me to use ncu but I am not sure what cmdline argument of ncu will provide equivalent. there is no metrics called branch_efficiency for sure (see below)

Without metrics cmdline parameter, nvprof still seem to provide some info (below).

nvprof --metrics branch_efficiency ./p84.out
======== Warning: Skipping profiling on device 0 since profiling is not supported on devices with compute capability 7.5 and higher.
Use NVIDIA Nsight Compute for GPU profiling and NVIDIA Nsight Systems for GPU tracing and CPU sampling.
Refer Nsight Developer Tools | NVIDIA Developer for more details.

==6699== NVPROF is profiling process 6699, command: ./p84.out
./p84.out using Device 0: NVIDIA GeForce RTX 2070 SUPER
Data size: 16777216.
Execution configured (block 1024 grid 16384).
Warmup <<<<16384 1024 >>> elapsed 0 sec
MathKernel1 <<<16384 1024 >>> elapsed 0 sec
MathKernel2 <<<16384 1024 >>> elapsed 0 sec
MathKernel3 <<<16384 1024 >>> elapsed 0 sec
MathKernel4 <<<16384 1024 >>> elapsed 0 sec
==6699== Profiling application: ./p84.out
==6699== Profiling result:
No events/metrics were profiled.

nvprof ./p84.out
==6672== NVPROF is profiling process 6672, command: ./p84.out
./p84.out using Device 0: NVIDIA GeForce RTX 2070 SUPER
Data size: 16777216.
Execution configured (block 1024 grid 16384).
Warmup <<<<16384 1024 >>> elapsed 0 sec
MathKernel1 <<<16384 1024 >>> elapsed 0 sec
MathKernel2 <<<16384 1024 >>> elapsed 0 sec
MathKernel3 <<<16384 1024 >>> elapsed 0 sec
MathKernel4 <<<16384 1024 >>> elapsed 0 sec
==6672== Profiling application: ./p84.out
==6672== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 55.75% 505.25us 3 168.42us 168.22us 168.70us mathKernel1(float*)
25.36% 229.82us 1 229.82us 229.82us 229.82us mathKernel2(float*)
18.89% 171.20us 1 171.20us 171.20us 171.20us warmingup(float*)
API calls: 97.57% 79.194ms 1 79.194ms 79.194ms 79.194ms cudaMalloc
1.13% 916.47us 6 152.75us 5.0420us 231.40us cudaDeviceSynchronize
0.97% 786.77us 5 157.35us 3.7130us 765.94us cudaLaunchKernel
0.18% 148.59us 114 1.3030us 103ns 61.066us cuDeviceGetAttribute
0.13% 102.12us 1 102.12us 102.12us 102.12us cudaGetDeviceProperties
0.01% 10.977us 1 10.977us 10.977us 10.977us cuDeviceGetName
0.01% 7.8040us 1 7.8040us 7.8040us 7.8040us cuDeviceGetPCIBusId
0.00% 937ns 3 312ns 101ns 589ns cuDeviceGetCount
0.00% 838ns 1 838ns 838ns 838ns cuModuleGetLoadingMode
0.00% 675ns 2 337ns 118ns 557ns cuDeviceGet
0.00% 514ns 1 514ns 514ns 514ns cuDeviceTotalMem
0.00% 180ns 1 180ns 180ns 180ns cuDeviceGetUuid
[guyen@localhost ch3]$ nvprof --metrics branch_efficiency ./p84.out

NCU:
[guyen@localhost ch3]$ ncu --list-metrics | grep branch
[guyen@localhost ch3]$ ncu --list-metrics | grep branch -i
[guyen@localhost ch3]$ ncu --list-metrics | grep occupancy
launch__occupancy_per_shared_mem_size
-launch__occupancy_per_shared_mem_size
launch__occupancy_per_register_count
-launch__occupancy_per_register_count
launch__occupancy_per_block_size
-launch__occupancy_per_block_size
launch__occupancy_limit_warps
-launch__occupancy_limit_warps
launch__occupancy_limit_shared_mem
-launch__occupancy_limit_shared_mem
launch__occupancy_limit_registers
-launch__occupancy_limit_registers
launch__occupancy_limit_blocks
-launch__occupancy_limit_blocks
launch__occupancy_per_cluster_size
-arch:90:90:launch__occupancy_per_cluster_size
launch__occupancy_cluster_pct
-arch:90:90:launch__occupancy_cluster_pct
launch__occupancy_cluster_gpu_pct
-arch:90:90:launch__occupancy_cluster_gpu_pct

rs277 · November 24, 2024, 12:04am

The ncu equivalent is smsp__sass_average_branch_targets_threads_uniform.pct

Sanjiv.Satoor · November 25, 2024, 5:28am

Please refer the Metric comparison and Event comparison sub-sections in the Nvprof Transition Guide section of the Nsight Compute CLI user guide.

g900nvda · November 27, 2024, 6:01pm

thx, i will try.

g900nvda · November 27, 2024, 6:11pm

One thing I noted was that if i do “ncu -o profile will generate proprietary profile file. Which I can load to ncu-ui GUI interface and see each kernel. Under there, there is a field called 'theoretical vs. achieved occupancy”. Wondering if it is same as smsp__sass_average_branch_targets_threads_uniform.pct?

I am still yet to compare these to see the same but posted here just as a question.

Topic		Replies	Views
How do i get some of the nvprof metrics in insight? Nsight Compute	0	785	June 2, 2021
Nvprof metrics in nsight? Nsight Compute	1	965	June 3, 2021
nvprof --metrics branch_efficiency..... Why no metrics ? Visual Profiler and nvprof	3	1803	December 14, 2019
Ncu to produce similar output as nvprof Nsight Compute	1	851	April 27, 2023
Get Nvprof-like information by Nsight Nsight Compute	5	762	June 27, 2023
Can't Get NCU GUI To Import Properly Nsight Compute	8	1592	October 5, 2020
How can I use ncu to get kernel runtime like use "nvprof --print-gpu-trace" Nsight Compute	5	1320	October 30, 2020
Is (nvprof metrics equivalent) CLI interface for printing result exists? Nsight Compute	6	842	May 31, 2019
Ncu fails to find metrics Nsight Compute kernel	1	584	June 12, 2023
Profiling for 7.5 CUDA Programming and Performance	3	1607	August 5, 2019

How to get nvprof equivalent of nvprof metrics --query-metrics

Related topics