Hello expert
When I used cublas to analyze performance, I found that the overhead suddenly increased.
Test case:
MN change, K = 512
When MN <= 19456, the overhead is basically around 2000 cycles.
When MN >= 20480, overhead will basically account for 0.5% of elapsed cycles max, which is greater than 50000 cycles
Test command:
ncu --metrics sm__cycles_elapsed.max,sm__cycles_active.max --cache-control all --clock-control base
MN = 19456 K = 512
elapsed.max - active.max = 2545
MN = 20480 K = 512
elapsed.max - active.max = 52432
more detail
ncu --set full --metrics sm__cycles_elapsed.max,sm__cycles_active.max --cac
he-control all --clock-control base -o mn19456 ./CublassTest
ncu --set full --metrics sm__cycles_elapsed.max,sm__cycles_active.max --cac
he-control all --clock-control base -o mn20480 ./CublassTest
ncu.zip (1.0 MB)
issue
-
Why does the overhead suddenly increase? No problem was found when MN was smaller than 19456
-
How can I measure kernel launch overhead using ncu - #6 by Greg
timeline -->
FE [1][2][3] [8]
SCHED [4]
CWD [5] [7]
SM [6--------------]
From this article, it can be seen that NCU is counting the 3-7 part of gputime, so is cuda event counting the 1-7 part?