Hello, I’m using ncu_cli to measure memory bw utilization of for decoder-only generation model. And I’m using nsight-compute to measure this using below command.
/usr/local/cuda-12.1/nsight-compute-2023.1.1/ncu --log-file $SAVE_DIR/profile_stat.txt --metrics dram__bytes_read|write.sum.pct_of_peak_sustained_elapsed --print-summary per-kernel -o $SAVE_DIR/ncu_memory_bw_profile --nvtx --nvtx-include "hello/" --replay-mode app-range -f python …
The weird thing I’m experiencing is it’s taking too long (4-5 hours) to profile just a single decoding step (one pass of 32 decoder layers). I tried filtering out unnecessary by setting nvtx range setting and changing replay-mode, but didn’t help significantly.
Two questions I have is:
- Are those metrics (dram__bytes_read|write.sum.pct_of_peak_sustained_elapsed) the write indicator for memory bw utilization?
- How can I accelerate the profiling time? Or is there more convenient way to measure gpu memory bw utilization?