Why does the “overhead” suddenly increase when testing cublas?

Hello expert
When I used cublas to analyze performance, I found that the overhead suddenly increased.
Test case:
MN change, K = 512
When MN <= 19456, the overhead is basically around 2000 cycles.
When MN >= 20480, overhead will basically account for 0.5% of elapsed cycles max, which is greater than 50000 cycles

Test command:

ncu --metrics sm__cycles_elapsed.max,sm__cycles_active.max --cache-control all --clock-control base

MN = 19456 K = 512


elapsed.max - active.max = 2545

MN = 20480 K = 512


elapsed.max - active.max = 52432

more detail

ncu --set full --metrics sm__cycles_elapsed.max,sm__cycles_active.max --cac
he-control all --clock-control base -o mn19456 ./CublassTest 
ncu --set full --metrics sm__cycles_elapsed.max,sm__cycles_active.max --cac
he-control all --clock-control base -o mn20480 ./CublassTest 

ncu.zip (1.0 MB)

issue

  1. Why does the overhead suddenly increase? No problem was found when MN was smaller than 19456

  2. How can I measure kernel launch overhead using ncu - #6 by Greg

timeline -->
FE          [1][2][3]                           [8]
SCHED                [4]
CWD                     [5]                  [7]
SM                          [6--------------]

From this article, it can be seen that NCU is counting the 3-7 part of gputime, so is cuda event counting the 1-7 part?

Moved to the Nsight Compute forum.