By reading the code of trtexec, here are some findings:
For each iteration,
trtexec records a bunch of timestamps in an array called
mEvents, where the timestamps are called
kOUTPUT_E. These events seem to be cudaEvents, coming from the CUDA runtime.
- S/E means start/end.
- kCOMPUTE seems to refer to GPU computation.
- kINPUT/kOUTPUT seems to refer to the cudaMemcpy’s time when copying data back and forth.
GPU Latency is simply
kCOMPUTE_E - kCOMPUTE_S.
Host Latency is simply
(kINPUT_E - kINPUT_S) + (kCOMPUTE_E - kCOMPUTE_S) + (kOUTPUT_E - kOUTPUT_S)
kOUTPUT_E - kINPUT_S
Enqueue is entirely a separate set of timestamps, known as
mEnqueueTimes. It’s the end of enqueue timestamp minus the start.
Average on 10 runs means that the above statistics are printed as the sum over 10 iterations divided by 10. It is just a numerical average.