I have pretty old cuda book released around 2014 which focuses on Fermi and Kepler named “Professional Cuda C programming”. Lot of examples mention about nvprof but with my system (rtx2070) with compute capability 7.5, nvprof no longer appears to be supported.
For example, tried branch efficienty metric:
nvprof --metrics branch_efficiency ./a.out 256 33554432
======== Warning: Skipping profiling on device 0 since profiling is not supported on devices with compute capability 7.5 and higher.
Use NVIDIA Nsight Compute for GPU profiling and NVIDIA Nsight Systems for GPU tracing and CPU sampling.
Refer NVIDIA Developer Tools Overview | NVIDIA Developer for more details.
Now I installed the nsight and tried command line vesrion for similar metrics but does not appear to be finding anything. Any ideas?
root@nonroot-MS-7B22:/git.co/dev-learn/gpu/cuda/linux/cuda-c-programming# nv-nsight-cu-cli --list-metrics | grep -i branch
root@nonroot-MS-7B22:/git.co/dev-learn/gpu/cuda/linux/cuda-c-programming# nv-nsight-cu-cli --list-metrics
sm__warps_active.avg.per_cycle_active
sm__warps_active.avg.pct_of_peak_sustained_active
sm__throughput.avg.pct_of_peak_sustained_elapsed
sm__maximum_warps_per_active_cycle_pct
sm__maximum_warps_avg_per_active_cycle
sm__cycles_active.avg
lts__throughput.avg.pct_of_peak_sustained_elapsed
launch__waves_per_multiprocessor
launch__thread_count
launch__shared_mem_per_block_static
launch__shared_mem_per_block_dynamic
launch__shared_mem_per_block_driver
launch__shared_mem_per_block
launch__shared_mem_config_size
launch__registers_per_thread
launch__occupancy_per_shared_mem_size
launch__occupancy_per_register_count
launch__occupancy_per_block_size
launch__occupancy_limit_warps
launch__occupancy_limit_shared_mem
launch__occupancy_limit_registers
launch__occupancy_limit_blocks
launch__grid_size
launch__func_cache_config
launch__block_size
l1tex__throughput.avg.pct_of_peak_sustained_active
gpu__time_duration.sum
gpu__dram_throughput.avg.pct_of_peak_sustained_elapsed
-arch:75:86:gpu__dram_throughput.avg.pct_of_peak_sustained_elapsed
-arch:40:70:gpu__dram_throughput.avg.pct_of_peak_sustained_elapsed
gpu__compute_memory_throughput.avg.pct_of_peak_sustained_elapsed
gpc__cycles_elapsed.max
gpc__cycles_elapsed.avg.per_second
dram__cycles_elapsed.avg.per_second
-arch:75:86:dram__cycles_elapsed.avg.per_second
-arch:40:70:dram__cycles_elapsed.avg.per_second
breakdown:sm__throughput.avg.pct_of_peak_sustained_elapsed
breakdown:gpu__compute_memory_throughput.avg.pct_of_peak_sustained_elapsed
root@nonroot-MS-7B22:/git.co/dev-learn/gpu/cuda/linux/cuda-c-programming#
I can get print-summary output but it outputs far more than necessary and not finding the specific one metric I was looking for, mentioned above:
==PROF== Connected to process 24102 (/git.co/dev-learn/gpu/cuda/linux/cuda-c-programming/a.out)
/git.co/dev-learn/gpu/cuda/linux/cuda-c-programming/./a.out using Device 0: NVIDIA GeForce RTX 2070 SUPER.
Data size 64.
Execution configure (block 64 grid 1).
==PROF== Profiling "warmingUp(float*)" - 1: 0%....50%....100% - 8 passes
warmup <<< 1 64 >>> elapsed 000000 sec.
==PROF== Profiling "mathKernel1(float*)" - 2: 0%....50%....100% - 8 passes
mathKernel1 <<< 1 64 >>> elapsed 000001 sec.
==PROF== Profiling "mathKernel2(float*)" - 3: 0%....50%....100% - 8 passes
mathKernel2 <<< 1 64 >>> elapsed 000000 sec.
==PROF== Disconnected from process 24102
[24102] a.out@127.0.0.1
Device 0
mathKernel1(float*), Block Size 64, Grid Size 1, 1 invocations
Section: GPU Speed Of Light
Metric Name Metric Unit Minimum Maximum Average
---------------------------------------------------------------- ------------- ----------- ----------- -----------
dram__cycles_elapsed.avg.per_second cycle/nsecond 6.468085 6.468085 6.468085
gpc__cycles_elapsed.avg.per_second cycle/nsecond 1.502992 1.502992 1.502992
gpc__cycles_elapsed.max cycle 2265.000000 2265.000000 2265.000000
gpu__compute_memory_throughput.avg.pct_of_peak_sustained_elapsed % 0.836062 0.836062 0.836062
gpu__dram_throughput.avg.pct_of_peak_sustained_elapsed % 0.025699 0.025699 0.025699
gpu__time_duration.sum usecond 1.504000 1.504000 1.504000
l1tex__throughput.avg.pct_of_peak_sustained_active % 22.429907 22.429907 22.429907
lts__throughput.avg.pct_of_peak_sustained_elapsed % 0.836062 0.836062 0.836062
sm__cycles_active.avg cycle 18.725000 18.725000 18.725000
sm__throughput.avg.pct_of_peak_sustained_elapsed % 0.009399 0.009399 0.009399
mathKernel1(float*), Block Size 64, Grid Size 1, 1 invocations
Section: Launch Statistics
Metric Name Metric Unit Minimum Maximum Average
------------------------------------ --------------- --------- --------- ---------
launch__block_size 64.000000 64.000000 64.000000
launch__grid_size 1.000000 1.000000 1.000000
launch__registers_per_thread register/thread 16.000000 16.000000 16.000000
launch__shared_mem_config_size Kbyte 32.768000 32.768000 32.768000
launch__shared_mem_per_block_driver byte/block 0.000000 0.000000 0.000000
launch__shared_mem_per_block_dynamic byte/block 0.000000 0.000000 0.000000
launch__shared_mem_per_block_static byte/block 0.000000 0.000000 0.000000
launch__thread_count thread 64.000000 64.000000 64.000000
launch__waves_per_multiprocessor 0.001563 0.001563 0.001563
mathKernel1(float*), Block Size 64, Grid Size 1, 1 invocations
Section: Occupancy
Metric Name Metric Unit Minimum Maximum Average
------------------------------------------------- ----------- ---------- ---------- ----------
launch__occupancy_limit_blocks block 16.000000 16.000000 16.000000
launch__occupancy_limit_registers block 64.000000 64.000000 64.000000
launch__occupancy_limit_shared_mem block 16.000000 16.000000 16.000000
launch__occupancy_limit_warps block 16.000000 16.000000 16.000000
sm__maximum_warps_avg_per_active_cycle warp 32.000000 32.000000 32.000000
sm__maximum_warps_per_active_cycle_pct % 100.000000 100.000000 100.000000
sm__warps_active.avg.pct_of_peak_sustained_active % 6.229139 6.229139 6.229139
sm__warps_active.avg.per_cycle_active warp 1.993324 1.993324 1.993324
mathKernel2(float*), Block Size 64, Grid Size 1, 1 invocations
Section: GPU Speed Of Light
Metric Name Metric Unit Minimum Maximum Average
---------------------------------------------------------------- ------------- ----------- ----------- -----------
dram__cycles_elapsed.avg.per_second cycle/nsecond 6.204082 6.204082 6.204082
gpc__cycles_elapsed.avg.per_second cycle/nsecond 1.495536 1.495536 1.495536
gpc__cycles_elapsed.max cycle 2350.000000 2350.000000 2350.000000
gpu__compute_memory_throughput.avg.pct_of_peak_sustained_elapsed % 0.861875 0.861875 0.861875
gpu__dram_throughput.avg.pct_of_peak_sustained_elapsed % 0.077097 0.077097 0.077097
gpu__time_duration.sum usecond 1.568000 1.568000 1.568000
l1tex__throughput.avg.pct_of_peak_sustained_active % 20.095694 20.095694 20.095694
lts__throughput.avg.pct_of_peak_sustained_elapsed % 0.861875 0.861875 0.861875
sm__cycles_active.avg cycle 20.900000 20.900000 20.900000
sm__throughput.avg.pct_of_peak_sustained_elapsed % 0.018654 0.018654 0.018654
mathKernel2(float*), Block Size 64, Grid Size 1, 1 invocations
Section: Launch Statistics
Metric Name Metric Unit Minimum Maximum Average
------------------------------------ --------------- --------- --------- ---------
launch__block_size 64.000000 64.000000 64.000000
launch__grid_size 1.000000 1.000000 1.000000
launch__registers_per_thread register/thread 16.000000 16.000000 16.000000
launch__shared_mem_config_size Kbyte 32.768000 32.768000 32.768000
launch__shared_mem_per_block_driver byte/block 0.000000 0.000000 0.000000
launch__shared_mem_per_block_dynamic byte/block 0.000000 0.000000 0.000000
launch__shared_mem_per_block_static byte/block 0.000000 0.000000 0.000000
launch__thread_count thread 64.000000 64.000000 64.000000
launch__waves_per_multiprocessor 0.001563 0.001563 0.001563
mathKernel2(float*), Block Size 64, Grid Size 1, 1 invocations
Section: Occupancy
Metric Name Metric Unit Minimum Maximum Average
------------------------------------------------- ----------- ---------- ---------- ----------
launch__occupancy_limit_blocks block 16.000000 16.000000 16.000000
launch__occupancy_limit_registers block 64.000000 64.000000 64.000000
launch__occupancy_limit_shared_mem block 16.000000 16.000000 16.000000
launch__occupancy_limit_warps block 16.000000 16.000000 16.000000
sm__maximum_warps_avg_per_active_cycle warp 32.000000 32.000000 32.000000
sm__maximum_warps_per_active_cycle_pct % 100.000000 100.000000 100.000000
sm__warps_active.avg.pct_of_peak_sustained_active % 6.231310 6.231310 6.231310
sm__warps_active.avg.per_cycle_active warp 1.994019 1.994019 1.994019
warmingUp(float*), Block Size 64, Grid Size 1, 1 invocations
Section: GPU Speed Of Light
Metric Name Metric Unit Minimum Maximum Average
---------------------------------------------------------------- ------------- ----------- ----------- -----------
dram__cycles_elapsed.avg.per_second cycle/nsecond 6.080000 6.080000 6.080000
gpc__cycles_elapsed.avg.per_second cycle/nsecond 1.465521 1.465521 1.465521
gpc__cycles_elapsed.max cycle 2349.000000 2349.000000 2349.000000
gpu__compute_memory_throughput.avg.pct_of_peak_sustained_elapsed % 0.862263 0.862263 0.862263
gpu__dram_throughput.avg.pct_of_peak_sustained_elapsed % 0.313528 0.313528 0.313528
gpu__time_duration.sum usecond 1.600000 1.600000 1.600000
l1tex__throughput.avg.pct_of_peak_sustained_active % 20.216606 20.216606 20.216606
lts__throughput.avg.pct_of_peak_sustained_elapsed % 0.862263 0.862263 0.862263
sm__cycles_active.avg cycle 20.775000 20.775000 20.775000
sm__throughput.avg.pct_of_peak_sustained_elapsed % 0.018657 0.018657 0.018657
warmingUp(float*), Block Size 64, Grid Size 1, 1 invocations
Section: Launch Statistics
Metric Name Metric Unit Minimum Maximum Average
------------------------------------ --------------- --------- --------- ---------
launch__block_size 64.000000 64.000000 64.000000
launch__grid_size 1.000000 1.000000 1.000000
launch__registers_per_thread register/thread 16.000000 16.000000 16.000000
launch__shared_mem_config_size Kbyte 32.768000 32.768000 32.768000
launch__shared_mem_per_block_driver byte/block 0.000000 0.000000 0.000000
launch__shared_mem_per_block_dynamic byte/block 0.000000 0.000000 0.000000
launch__shared_mem_per_block_static byte/block 0.000000 0.000000 0.000000
launch__thread_count thread 64.000000 64.000000 64.000000
launch__waves_per_multiprocessor 0.001563 0.001563 0.001563
warmingUp(float*), Block Size 64, Grid Size 1, 1 invocations
Section: Occupancy
Metric Name Metric Unit Minimum Maximum Average
------------------------------------------------- ----------- ---------- ---------- ----------
launch__occupancy_limit_blocks block 16.000000 16.000000 16.000000
launch__occupancy_limit_registers block 64.000000 64.000000 64.000000
launch__occupancy_limit_shared_mem block 16.000000 16.000000 16.000000
launch__occupancy_limit_warps block 16.000000 16.000000 16.000000
sm__maximum_warps_avg_per_active_cycle warp 32.000000 32.000000 32.000000
sm__maximum_warps_per_active_cycle_pct % 100.000000 100.000000 100.000000
sm__warps_active.avg.pct_of_peak_sustained_active % 6.231197 6.231197 6.231197
sm__warps_active.avg.per_cycle_active warp 1.993983 1.993983 1.993983
Note: The shown averages are calculated as the arithmetic mean of the metric values after the evaluation of the metrics for each individual kernel launch.
If aggregating across varying launch configurations (like shared memory, cache config settings), the arithmetic mean can be misleading and looking at the individual results is recommended instead.
This output mode is backwards compatible to the per-kernel summary output of nvprof
root@nonroot-MS-7B22:/git.co/dev-learn/gpu/cuda/linux/cuda-c-programming#