OK. I did a test with a large matrix multiplication run in two scenarios:
- cycles,time,sass
void MatrixMulCUDA<32>(float*, float*, float*, int, int), 2022-Jan-26 15:36:41, Context 1, Stream 13
Section: Command line profiler metrics
---------------------------------------------------------------------- --------------- ------------------------------
gpc__cycles_elapsed.avg cycle 3,459,795.33
gpu__time_duration.sum msecond 2.41
smsp__sass_inst_executed.sum inst 183,468,032
---------------------------------------------------------------------- --------------- ------------------------------
void MatrixMulCUDA<32>(float*, float*, float*, int, int), 2022-Jan-26 15:36:41, Context 1, Stream 13
Section: Command line profiler metrics
---------------------------------------------------------------------- --------------- ------------------------------
gpc__cycles_elapsed.avg cycle 3,458,849.17
gpu__time_duration.sum msecond 2.40
smsp__sass_inst_executed.sum inst 183,468,032
---------------------------------------------------------------------- --------------- ------------------------------
and 2) cycles,time
void MatrixMulCUDA<32>(float*, float*, float*, int, int), 2022-Jan-26 15:36:52, Context 1, Stream 13
Section: Command line profiler metrics
---------------------------------------------------------------------- --------------- ------------------------------
gpc__cycles_elapsed.avg cycle 3,450,932
gpu__time_duration.sum msecond 2.40
---------------------------------------------------------------------- --------------- ------------------------------
void MatrixMulCUDA<32>(float*, float*, float*, int, int), 2022-Jan-26 15:36:52, Context 1, Stream 13
Section: Command line profiler metrics
---------------------------------------------------------------------- --------------- ------------------------------
gpc__cycles_elapsed.avg cycle 3,450,957.83
gpu__time_duration.sum msecond 2.40
---------------------------------------------------------------------- --------------- ------------------------------
So, it seems that the gpu__time_duration.sum
is robust in this example. Don’t know if I increase the sass metrics or problem size, that will also change.
P.S: I also run with smsp__sass_inst_executed
and smsp__inst_executed
and they were equal.