Consistency of data collected by nvprof and nsight compute

  1. In tesla v100, I used nvprof to collect the dram_read_bytes and dram_read_bytes of the shufflenetv2_05 model during inference. The command used was: sudo /usr/local/cuda/bin/nvprof -m dram_read_bytes -m dram_write_bytes --log -file dram-bytes-shufflenet-nvprof.log python3 shufflenet-py-224-if-1.py. At the same time, in order to visualize it, I also used nsight compute to collect the data of the program. The command used was: sudo /usr/local/cuda-11.2/bin/ncu --section MemoryWorkloadAnalysis_Chart --section MemoryWorkloadAnalysis_Tables -o gpu -reset-trancasion-chrat-table-shufflenet python3 shufflenet-py-224-if-1.py

  2. But there are differences in the results of the two: (1) The function in nvprof will perform two steps of dram_read_bytes and dram_read_bytes, both reading and writing data. As shown in [1], the execution of the volta_scudnn_128x32_relu_small_nn_v1 function: read 20192 Byte of data and write 1216192 Byte of data. (2) The results displayed in nsight compute only read the data (as shown in [2]), 680128Byte, why not display the written data? And there is a big gap between the data read by the two?

Thank you for your answer!


The difference in dram metric value between nvprof and nsight compute can be due to difference in cache-control.

nsight compute by default flushes all caches before each kernel replay pass. But nvprof does not flush.

You can try using the nsight compute command line option –cache-control none.
(refer Nsight Compute CLI :: Nsight Compute Documentation)

In addition, if the GPU is also driving the display and an X-server or any other application is running in the background, the DRAM metrics can also be affected. That should impact both tools in the same way, but can also cause non-determinism between different runs.

Thank you for your answer.