nv-nsight-cu-cli -f -o sampe --target-processes all --set full ./a.out
nv-nsight-cu-cli -f -o sampe --target-processes all --clock-control none --set full ./a.out
I use these cmds to profiling kernel (How to reach HBM Peak bandwidth performance), which can lead to different SM frequency. First cmd: times: 676.49us, cycles is 741794, SM Frequency: 1.10 cycle / nsecond; Second cmd: 678.18us, cycles is 956869, SM frequancy: 1.41 cycle / nsecond; And this kernel is simple memory copy kernel and hasn’t flops.
Profiling results show that these two profiling time is similar, but cycles has huge gap. In nsight compute, cycles value is computed from times and sm frequency? or times is computed from cycles and sm frequency? And for memory bound kernel, should DRAM frequency be replace SM frequency?