GTC 2020: High-Performance Inferencing at Scale Using the TensorRT Inference Server

GTC 2020 S22418
Presenters: David Goodwin,NVIDIA
Abstract
A critical task when deploying an inferencing solution at scale is to optimize latency and throughput to meet the solution’s service level objectives. We’ll discuss some of the capabilities provided by the NVIDIA TensorRT Inference Server that you can leverage to reach these performance objectives. These capabilities include:
• Dynamic TensorFlow and ONNX model optimization using TensorRT
• Inference compute optimization using advanced scheduling and batching techniques
• Model pipeline optimization that communicates intermediate results via GPU memory
• End-to-end solution optimization using system or CUDA shared memory to reduce network I/O.
For all these techniques, we’ll quantify the improvements by providing performance results using the latest NVIDIA GPUs.

Watch this session
Join in the conversation below.