How to get nvprof equivalent of nvprof metrics --query-metrics

Book I am studying from fairly old and uses now defunct nvprof for various profiling.
It uses following for branch occupancy:
nvprof metrics --branch_efficiency
But it complains that the nvprof is too old for CC 7.5, to get it work, I either have to use very old cuda toolkit that supports CC 7.5 or below. It then suggests me to use ncu but I am not sure what cmdline argument of ncu will provide equivalent. there is no metrics called branch_efficiency for sure (see below)

Without metrics cmdline parameter, nvprof still seem to provide some info (below).

nvprof --metrics branch_efficiency ./p84.out
======== Warning: Skipping profiling on device 0 since profiling is not supported on devices with compute capability 7.5 and higher.
Use NVIDIA Nsight Compute for GPU profiling and NVIDIA Nsight Systems for GPU tracing and CPU sampling.
Refer Nsight Developer Tools | NVIDIA Developer for more details.

==6699== NVPROF is profiling process 6699, command: ./p84.out
./p84.out using Device 0: NVIDIA GeForce RTX 2070 SUPER
Data size: 16777216.
Execution configured (block 1024 grid 16384).
Warmup <<<<16384 1024 >>> elapsed 0 sec
MathKernel1 <<<16384 1024 >>> elapsed 0 sec
MathKernel2 <<<16384 1024 >>> elapsed 0 sec
MathKernel3 <<<16384 1024 >>> elapsed 0 sec
MathKernel4 <<<16384 1024 >>> elapsed 0 sec
==6699== Profiling application: ./p84.out
==6699== Profiling result:
No events/metrics were profiled.

nvprof ./p84.out
==6672== NVPROF is profiling process 6672, command: ./p84.out
./p84.out using Device 0: NVIDIA GeForce RTX 2070 SUPER
Data size: 16777216.
Execution configured (block 1024 grid 16384).
Warmup <<<<16384 1024 >>> elapsed 0 sec
MathKernel1 <<<16384 1024 >>> elapsed 0 sec
MathKernel2 <<<16384 1024 >>> elapsed 0 sec
MathKernel3 <<<16384 1024 >>> elapsed 0 sec
MathKernel4 <<<16384 1024 >>> elapsed 0 sec
==6672== Profiling application: ./p84.out
==6672== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 55.75% 505.25us 3 168.42us 168.22us 168.70us mathKernel1(float*)
25.36% 229.82us 1 229.82us 229.82us 229.82us mathKernel2(float*)
18.89% 171.20us 1 171.20us 171.20us 171.20us warmingup(float*)
API calls: 97.57% 79.194ms 1 79.194ms 79.194ms 79.194ms cudaMalloc
1.13% 916.47us 6 152.75us 5.0420us 231.40us cudaDeviceSynchronize
0.97% 786.77us 5 157.35us 3.7130us 765.94us cudaLaunchKernel
0.18% 148.59us 114 1.3030us 103ns 61.066us cuDeviceGetAttribute
0.13% 102.12us 1 102.12us 102.12us 102.12us cudaGetDeviceProperties
0.01% 10.977us 1 10.977us 10.977us 10.977us cuDeviceGetName
0.01% 7.8040us 1 7.8040us 7.8040us 7.8040us cuDeviceGetPCIBusId
0.00% 937ns 3 312ns 101ns 589ns cuDeviceGetCount
0.00% 838ns 1 838ns 838ns 838ns cuModuleGetLoadingMode
0.00% 675ns 2 337ns 118ns 557ns cuDeviceGet
0.00% 514ns 1 514ns 514ns 514ns cuDeviceTotalMem
0.00% 180ns 1 180ns 180ns 180ns cuDeviceGetUuid
[guyen@localhost ch3]$ nvprof --metrics branch_efficiency ./p84.out

NCU:
[guyen@localhost ch3]$ ncu --list-metrics | grep branch
[guyen@localhost ch3]$ ncu --list-metrics | grep branch -i
[guyen@localhost ch3]$ ncu --list-metrics | grep occupancy
launch__occupancy_per_shared_mem_size
-launch__occupancy_per_shared_mem_size
launch__occupancy_per_register_count
-launch__occupancy_per_register_count
launch__occupancy_per_block_size
-launch__occupancy_per_block_size
launch__occupancy_limit_warps
-launch__occupancy_limit_warps
launch__occupancy_limit_shared_mem
-launch__occupancy_limit_shared_mem
launch__occupancy_limit_registers
-launch__occupancy_limit_registers
launch__occupancy_limit_blocks
-launch__occupancy_limit_blocks
launch__occupancy_per_cluster_size
-arch:90:90:launch__occupancy_per_cluster_size
launch__occupancy_cluster_pct
-arch:90:90:launch__occupancy_cluster_pct
launch__occupancy_cluster_gpu_pct
-arch:90:90:launch__occupancy_cluster_gpu_pct

The ncu equivalent is smsp__sass_average_branch_targets_threads_uniform.pct

Please refer the Metric comparison and Event comparison sub-sections in the Nvprof Transition Guide section of the Nsight Compute CLI user guide.

thx, i will try.

One thing I noted was that if i do “ncu -o profile will generate proprietary profile file. Which I can load to ncu-ui GUI interface and see each kernel. Under there, there is a field called 'theoretical vs. achieved occupancy”. Wondering if it is same as smsp__sass_average_branch_targets_threads_uniform.pct?

I am still yet to compare these to see the same but posted here just as a question.