Trtexec profiling summary explanation

Dear Sir,

I ran model using trtexec wrapper and found below profile summary:

=== Performance summary ===
[06/01/2022-06:42:46] [I] Throughput: 92.0084 qps
[06/01/2022-06:42:46] [I] Latency: min = 10.542 ms, max = 14.997 ms, mean = 11.0186 ms, median = 10.647 ms, percentile(99%) = 13.6331 ms
[06/01/2022-06:42:46] [I] End-to-End Host Latency: min = 20.1821 ms, max = 26.5759 ms, mean = 21.2473 ms, median = 20.4951 ms, percentile(99%) = 26.2098 ms
[06/01/2022-06:42:46] [I] Enqueue Time: min = 0.324219 ms, max = 2.18774 ms, mean = 1.28381 ms, median = 1.31104 ms, percentile(99%) = 2.14038 ms
[06/01/2022-06:42:46] [I] H2D Latency: min = 0.101318 ms, max = 0.274658 ms, mean = 0.155005 ms, median = 0.13147 ms, percentile(99%) = 0.249268 ms
[06/01/2022-06:42:46] [I] GPU Compute Time: min = 10.369 ms, max = 14.8224 ms, mean = 10.8245 ms, median = 10.415 ms, percentile(99%) = 13.4738 ms
[06/01/2022-06:42:46] [I] D2H Latency: min = 0.0361328 ms, max = 0.0592041 ms, mean = 0.0390707 ms, median = 0.0380859 ms, percentile(99%) = 0.0585938 ms
[06/01/2022-06:42:46] [I] Total Host Walltime: 3.0432 s
[06/01/2022-06:42:46] [I] Total GPU Compute Time: 3.03087 s
[06/01/2022-06:42:46] [I] Explanations of the performance metrics are printed in the verbose logs.

I want to understand the terms Total GPU Compute Time: 3.03087 s and GPU Compute Time: = 13.4738 ms

Which should be considered the correct term to calculate fps for running a single frame?

Thanks and Regards,
Vyom Mishra

Please refer to the below link for Sample guide.

Refer to the installation steps from the link if in case you are missing on anything

However suggested approach is to use TRT NGC containers to avoid any system dependency related issues.

In order to run python sample, make sure TRT python packages are installed while using NGC container.

In case, if you are trying to run custom model, please share your model and script with us, so that we can assist you better.

Dear Sir,

Thanks for the reply!

My question is related to the term
“Total GPU compute time” and “Compute time”.

Both have different performance timings in the logs shared above
Total GPU Compute Time: 3.03087 s and GPU Compute Time: = 13.4738 ms

Which needs to be considered as GPU compute time for the single frame?

Vyom Mishra

Dear Sir,

I am sharing you the layer-wise profiling of the model.

Profile (289 iterations ) ===

Model layers and structure is kept confidential.

You can find the final sum of profiling time.

I want to understand, when i have used the flag " --avgRuns=1" then why iteration in the logs is Profile (289 iterations)

this model ran with single input with single iteration and gives performance as

Please let me confirm , which is the correct profiling result either 3sec or 11ms?
And I want to understand why iteration is 289 when the model is provide with the flag to restrict the inference repetition.

Vyom Mishra

Dear @NVES ,

Gentle Reminder!
Kindly help.

Vyom Mishra


Please find the following info regarding performance metrics, you can get this using --verbose option with trtexec command.
=== Explanations of the performance metrics ===
Total Host Walltime: the host walltime from when the first query (after warmups) is enqueued to when the last query is completed.
GPU Compute Time: the GPU latency to execute the kernels for a query.
Total GPU Compute Time: the summation of the GPU Compute Time of all the queries. If this is significantly shorter than Total Host Walltime, the GPU may be under-utilized because of host-side overheads or data transfers.
Throughput: the observed throughput computed by dividing the number of queries by the Total Host Walltime. If this is significantly lower than the reciprocal of GPU Compute Time, the GPU may be under-utilized because of host-side overheads or data transfers.
Enqueue Time: the host latency to enqueue a query. If this is longer than GPU Compute Time, the GPU may be under-utilized.
H2D Latency: the latency for host-to-device data transfers for input tensors of a single query.
D2H Latency: the latency for device-to-host data transfers for output tensors of a single query.
Latency: the summation of H2D Latency, GPU Compute Time, and D2H Latency. This is the latency to infer a single query.
End-to-End Host Latency: the duration from when the H2D of a query is called to when the D2H of the same query is completed, which includes the latency to wait for the completion of the previous query. This is the latency of a query if multiple queries are enqueued consecutively.

Regarding your other query on layer-wise profiling, let us get back to you.

Thank you.

Dear @spolisetty ,

Thanks for the reply!

I have gone through the definition of the performance terms explained above.

I have a doubt in understanding it.

Can you please explain what does query means here in the definition?
I want to know if I am running a model for single input once, then which one is the correct performance parameter to find the profiling of the model is it Total GPU Compute time or GPU Compute time?

Vyom Mishra

This would be GPU Compute time. “Query” refers to a single inference (forward) execution. E.g. for resnet, if you run an inference with input (1,3,224,224), that’s a query. If you use (1024,3,224,224) for input, that’s also a query. So sometimes it is useful to normalize QPS (queries per second): normalized_qps = qps / BS.
Please refer to the following doc for more details.

Thank you.