In tesla v100, I used nvprof to collect the dram_read_bytes and dram_read_bytes of the shufflenetv2_05 model during inference. The command used was: sudo /usr/local/cuda/bin/nvprof -m dram_read_bytes -m dram_write_bytes --log -file dram-bytes-shufflenet-nvprof.log python3 shufflenet-py-224-if-1.py. At the same time, in order to visualize it, I also used nsight compute to collect the data of the program. The command used was: sudo /usr/local/cuda-11.2/bin/ncu --section MemoryWorkloadAnalysis_Chart --section MemoryWorkloadAnalysis_Tables -o gpu -reset-trancasion-chrat-table-shufflenet python3 shufflenet-py-224-if-1.py
But there are differences in the results of the two: (1) The function in nvprof will perform two steps of dram_read_bytes and dram_read_bytes, both reading and writing data. As shown in , the execution of the volta_scudnn_128x32_relu_small_nn_v1 function: read 20192 Byte of data and write 1216192 Byte of data. (2) The results displayed in nsight compute only read the data (as shown in ), 680128Byte, why not display the written data? And there is a big gap between the data read by the two?
Thank you for your answer!