I have pretty old cuda book released around 2014 which focuses on Fermi and Kepler named “Professional Cuda C programming”. Lot of examples mention about nvprof but with my system (rtx2070) with compute capability 7.5, nvprof no longer appears to be supported.
For example, tried branch efficienty metric:
nvprof --metrics branch_efficiency ./a.out 256 33554432
======== Warning: Skipping profiling on device 0 since profiling is not supported on devices with compute capability 7.5 and higher.
Use NVIDIA Nsight Compute for GPU profiling and NVIDIA Nsight Systems for GPU tracing and CPU sampling.
Refer NVIDIA Developer Tools Overview | NVIDIA Developer for more details.
Now I installed the nsight and tried command line vesrion for similar metrics but does not appear to be finding anything. Any ideas?
root@nonroot-MS-7B22:/git.co/dev-learn/gpu/cuda/linux/cuda-c-programming# nv-nsight-cu-cli --list-metrics | grep -i branch
root@nonroot-MS-7B22:/git.co/dev-learn/gpu/cuda/linux/cuda-c-programming# nv-nsight-cu-cli --list-metrics
sm__warps_active.avg.per_cycle_active
sm__warps_active.avg.pct_of_peak_sustained_active
sm__throughput.avg.pct_of_peak_sustained_elapsed
sm__maximum_warps_per_active_cycle_pct
sm__maximum_warps_avg_per_active_cycle
sm__cycles_active.avg
lts__throughput.avg.pct_of_peak_sustained_elapsed
launch__waves_per_multiprocessor
launch__thread_count
launch__shared_mem_per_block_static
launch__shared_mem_per_block_dynamic
launch__shared_mem_per_block_driver
launch__shared_mem_per_block
launch__shared_mem_config_size
launch__registers_per_thread
launch__occupancy_per_shared_mem_size
launch__occupancy_per_register_count
launch__occupancy_per_block_size
launch__occupancy_limit_warps
launch__occupancy_limit_shared_mem
launch__occupancy_limit_registers
launch__occupancy_limit_blocks
launch__grid_size
launch__func_cache_config
launch__block_size
l1tex__throughput.avg.pct_of_peak_sustained_active
gpu__time_duration.sum
gpu__dram_throughput.avg.pct_of_peak_sustained_elapsed
-arch:75:86:gpu__dram_throughput.avg.pct_of_peak_sustained_elapsed
-arch:40:70:gpu__dram_throughput.avg.pct_of_peak_sustained_elapsed
gpu__compute_memory_throughput.avg.pct_of_peak_sustained_elapsed
gpc__cycles_elapsed.max
gpc__cycles_elapsed.avg.per_second
dram__cycles_elapsed.avg.per_second
-arch:75:86:dram__cycles_elapsed.avg.per_second
-arch:40:70:dram__cycles_elapsed.avg.per_second
breakdown:sm__throughput.avg.pct_of_peak_sustained_elapsed
breakdown:gpu__compute_memory_throughput.avg.pct_of_peak_sustained_elapsed
root@nonroot-MS-7B22:/git.co/dev-learn/gpu/cuda/linux/cuda-c-programming#
I can get print-summary output but it outputs far more than necessary and not finding the specific one metric I was looking for, mentioned above