-
In tesla v100, I used nvprof to collect the dram_read_bytes and dram_read_bytes of the shufflenetv2_05 model during inference. The command used was: sudo /usr/local/cuda/bin/nvprof -m dram_read_bytes -m dram_write_bytes --log -file dram-bytes-shufflenet-nvprof.log python3 shufflenet-py-224-if-1.py. At the same time, in order to visualize it, I also used nsight compute to collect the data of the program. The command used was: sudo /usr/local/cuda-11.2/bin/ncu --section MemoryWorkloadAnalysis_Chart --section MemoryWorkloadAnalysis_Tables -o gpu -reset-trancasion-chrat-table-shufflenet python3 shufflenet-py-224-if-1.py
-
But there are differences in the results of the two: (1) The function in nvprof will perform two steps of dram_read_bytes and dram_read_bytes, both reading and writing data. As shown in [1], the execution of the volta_scudnn_128x32_relu_small_nn_v1 function: read 20192 Byte of data and write 1216192 Byte of data. (2) The results displayed in nsight compute only read the data (as shown in [2]), 680128Byte, why not display the written data? And there is a big gap between the data read by the two?
Thank you for your answer!