I read the trtexec --help but I would like some precisions about the data collected by trtexec.
In order to manipulate trtexec profiling data I used the following option :
–exportTimes= Write the timing results in a json file (default = disabled)
Then I used the related script to extract data.
From the trace.json I get an array with the following results :
startInMs, endInMs, startComputeMs, endComputeMs, startOutMs, endOutMs, inMs, computeMs, outMs, latencyMs, endToEndMs
“inMs” : time to transfer input data in GPU memory.
“computeMs” : time used by the GPU to calculate one batch.
“endToEndMs” = “inMS”+“ComputeMs”+“outMs”
What is the difference between “computeMs” and “latencyMs” at this point?
Throughput would be equal to computeMs/batch_size.
Latency would be equal to computeMs.
I did not specify “–iterations” options and the default iteration number is “at least” 10.
I did not specify “–avgRun option” and the default measurements are averaged over 10 consecutive iterations.
I get 4065 lines of measure.
To be sure, each line correspond to an averaged measure and trtexec did 4065 iterations?
I also tried to extract a profile.json.
–exportProfile= Write the profile information per layer in a json file (default = disabled)
I work with dynamic shapes so I specified a minShapes, optShapes and maxShapes for building the profile.
trtexec run without error but the profile.json file stays empty. (I checked filename and path)
Also I want to be sure the profiling data was obtained with CUDA GPU profiling and not a CPU absolute clock.
Hope following will help you,
=== Explanations of the performance metrics ===
- GPU Compute Time: the GPU latency to execute the kernels for a query.
- Total GPU Compute Time: the summation of the GPU Compute Time of all the queries. If this is significantly shorter than Total Host Walltime, the GPU may be under-utilized because of host-side overheads or data transfers.
- Throughput: the observed throughput computed by dividing the number of queries by the Total Host Walltime. If this is significantly lower than the reciprocal of GPU Compute Time, the GPU may be under-utilized because of host-side overheads or data transfers.
- Enqueue Time: the host latency to enqueue a query. If this is longer than GPU Compute Time, the GPU may be under-utilized. or host-to-device data transfers for input tensors of a single query. or device-to-host data transfers for output tensors of a single query.
- Latency: the summation of H2D Latency, GPU Compute Time, and D2H Latency. This is the latency to infer a single query.
- End-to-End Host Latency: the duration from when the H2D of a query is called to when the D2H of the same query is completed, which includes the latency to wait for the completion of the previous query. This is the latency of a query if multiple queries are enqueued consecutively.
exportProfile, we request you to share the model and command you’re following for better assistance.
Please refer to the installation steps from the below link if in case you are missing on anything
However suggested approach is to use TRT NGC containers to avoid any system dependency related issues.
In order to run python sample, make sure TRT python packages are installed while using NGC container.