TensorRT trtexec.exe profiling tool - GPU vs Host latency

Hello,

I used the trtexec.exe profiling tool and got lines like the following:

[02/16/2021-18:15:54] [I] Average on 10 runs - GPU latency: 6.32176 ms - Host latency: 6.44522 ms (end to end 12.4829 ms, enqueue 1.09462 ms)

My question is: to what these latencies refer exactly ? What is the difference between the GPU latency, the Host latency, the end to end latency, and the enqueue latency ?

Thanks!

Hi @jeremie.gringet,

Please refer the following.

[V] === Explanations of the performance metrics ===
[V] Total Host Walltime: the host walltime from when the first query (after warmups) is enqueued to when the last query is completed.
[V] GPU Compute Time: the GPU latency to execute the kernels for a query.
[V] Total GPU Compute Time: the summation of the GPU Compute Time of all the queries. If this is significantly shorter than Total Host Walltime, the GPU may be under-utilized because of host-side overheads or data transfers.
[V] Throughput: the observed throughput computed by dividing the number of queries by the Total Host Walltime. If this is significantly lower than the reciprocal of GPU Compute Time, the GPU may be under-utilized because of host-side overheads or data transfers.
[V] Enqueue Time: the host latency to enqueue a query. If this is longer than GPU Compute Time, the GPU may be under-utilized.
[V] H2D Latency: the latency for host-to-device data transfers for input tensors of a single query.
[V] D2H Latency: the latency for device-to-host data transfers for output tensors of a single query.
[V] Latency: the summation of H2D Latency, GPU Compute Time, and D2H Latency. This is the latency to infer a single query.
[V] End-to-End Host Latency: the duration from when the H2D of a query is called to when the D2H of the same query is completed, which includes the latency to wait for the completion of the previous query. This is the latency of a query if multiple queries are enqueued consecutively.

Thank you.

2 Likes

By reading the code of trtexec, here are some findings:

For each iteration, trtexec records a bunch of timestamps in an array called mEvents, where the timestamps are called kCOMPUTE_S, kCOMPUTE_E, kINPUT_S, kINPUT_E, kOUTPUT_S and kOUTPUT_E. These events seem to be cudaEvents, coming from the CUDA runtime.

  1. S/E means start/end.
  2. kCOMPUTE seems to refer to GPU computation.
  3. kINPUT/kOUTPUT seems to refer to the cudaMemcpy’s time when copying data back and forth.

GPU Latency is simply kCOMPUTE_E - kCOMPUTE_S.
Host Latency is simply (kINPUT_E - kINPUT_S) + (kCOMPUTE_E - kCOMPUTE_S) + (kOUTPUT_E - kOUTPUT_S)
End-to-end is kOUTPUT_E - kINPUT_S
Enqueue is entirely a separate set of timestamps, known as mEnqueueTimes. It’s the end of enqueue timestamp minus the start.

The Average on 10 runs means that the above statistics are printed as the sum over 10 iterations divided by 10. It is just a numerical average.

2 Likes