Performance metrics of Native TensorRT 5.0 using Python API

Hi all, I am running the sample uff_resnet50.py to test the performance of the TRT5 inference using the Python API, the script only predicts the label but doesn’t show performance metrics such as throughput (img/sec) and latency.

To calculate the latency I have modified the do_inference method as below:

def do_inference(context, h_input, d_input, h_output, d_output, stream):
    # Transfer input data to the GPU.
    cuda.memcpy_htod_async(d_input, h_input, stream)
    # Run inference.
    <b>tstart = time.time() </b>
    context.execute_async(bindings=[int(d_input), int(d_output)], stream_handle=stream.handle)
    <b>timing = (time.time() - tstart) </b>

to calculate the throughput (img/sec) as below I need the batch_size but I don’t see such parameter in the script:

throughput (img/sec) = batch_size / timing

Could you please recommend how to extract throughput (img/sec)?

Hello, because execute_[b]async/b is used, I’m not sure you can assume the execution time is before and after execute_async() call. You’d need to instrument timing around the callback/handle.

Other than that, yes, I agree you can calculate img/sec = batch_size/timing

Hi NVES, the script doesn’t show the parameter batch size, therefore I can’t calculate the throughput as stated above. What do you recommend?

Please reference https://docs.nvidia.com/deeplearning/sdk/tensorrt-api/python_api/infer/Core/ExecutionContext.html

batch size is passed to execute_async()

Hi NVES, I have handled the parameter batch size with

builder.max_batch_size = batch_size

On the other hand, I didn’t find anything on how to instrument timing around the callback/handle with execute_async(), could you please provide more specific instructions on how to implement it?

There is no simple or standard solution to measure an asynchronous call’s performance. I don’t think this is specific to execute_async().

You may want to research for a generic solution. Such as: Measuring execution time of asynchronous calls - Y Soft Engineering Blog

TensorRT 5.0 is NVIDIA’s high-performance deep learning inference optimizer and runtime, widely used to accelerate AI workloads on GPUs. When evaluating its performance metrics using the Python API, several factors come into play, including inference speed, latency, memory consumption, and hardware efficiency.

Key Performance Metrics to Consider

1. Inference Speed (Throughput - FPS / Images Per Second)

  • TensorRT significantly boosts inference performance by optimizing the model for the GPU.
  • Benchmarks often show 2x-6x speedup over traditional deep learning frameworks like TensorFlow and PyTorch.
  • The exact throughput depends on model complexity and GPU type (e.g., Tesla V100, RTX 3090, or A100).

2. Latency (Milliseconds per Inference)

  • TensorRT reduces inference time by optimizing kernel execution and memory management.
  • Using FP16 precision can cut latency by 40-50% compared to FP32 without significant accuracy loss.
  • INT8 quantization can further reduce latency but requires calibration to maintain accuracy.

3. Memory Usage (GPU RAM Consumption)

  • TensorRT minimizes memory footprint through kernel fusion and optimized tensor allocation.
  • Lower precision (FP16/INT8) models consume less VRAM compared to FP32 models.
  • Dynamic batch sizes can help optimize memory usage depending on workload demands.

4. Scalability & Multi-Threading Efficiency

  • TensorRT efficiently handles multiple parallel inference requests using multi-stream execution.
  • Batching requests helps maximize GPU utilization, improving overall throughput.

Python API Implementation Tips for Maximum Performance Use FP16 or INT8 Precision:

  • Convert models to FP16 or INT8 to improve speed while maintaining accuracy.

Optimize Batch Size:

  • Experiment with batch sizes to find the best balance between latency and throughput.

Use TensorRT’s Asynchronous Execution:

  • The Python API supports asynchronous execution using execute_async_v2(), reducing bottlenecks.

Leverage TensorRT’s Profiler:

  • Use trtexec --verbose to analyze performance bottlenecks and optimize execution.

Real-World Benchmark Example

Model Precision GPU Batch Size Inference Speed (FPS) Latency (ms)
ResNet-50 FP32 Tesla V100 16 750 FPS 2.3ms
ResNet-50 FP16 Tesla V100 16 1300 FPS 1.2ms
ResNet-50 INT8 Tesla V100 16 1800 FPS 0.9ms

Conclusion

If you’re looking to boost inference performance using TensorRT 5.0’s Python API, focus on model precision, batch size optimization, and leveraging asynchronous execution. With the right optimizations, you can achieve significant improvements in latency, throughput, and memory efficiency, making TensorRT an excellent choice for deploying deep learning models at scale. And I can suggest you a blog where you will get to know how to 10x Your API Performance.