Ncu takes too long

Hello, I’m using ncu_cli to measure memory bw utilization of for decoder-only generation model. And I’m using nsight-compute to measure this using below command.

/usr/local/cuda-12.1/nsight-compute-2023.1.1/ncu --log-file $SAVE_DIR/profile_stat.txt --metrics dram__bytes_read|write.sum.pct_of_peak_sustained_elapsed --print-summary per-kernel -o $SAVE_DIR/ncu_memory_bw_profile --nvtx --nvtx-include "hello/" --replay-mode app-range -f python …

The weird thing I’m experiencing is it’s taking too long (4-5 hours) to profile just a single decoding step (one pass of 32 decoder layers). I tried filtering out unnecessary by setting nvtx range setting and changing replay-mode, but didn’t help significantly.

Two questions I have is:

  1. Are those metrics (dram__bytes_read|write.sum.pct_of_peak_sustained_elapsed) the write indicator for memory bw utilization?
  2. How can I accelerate the profiling time? Or is there more convenient way to measure gpu memory bw utilization?

I’m doing almost same thing,
It took 1 day