GTC 2020: Accelerating GNMT Inference on GPU

GTC 2020 S21180
Presenters: Maxim Milakov,NVIDIA; Jeremy Appleyard,NVIDIA
Abstract
Google Neural Machine Translation (GNMT) is one of the benchmarks in the MLPerf inference benchmark suite, representing Seq2Seq models. The benchmark measures throughput under latency constraints. We’ll go through the challenges that we, at NVIDIA, faced when implementing the GNMT benchmark and how we solved them with NVIDIA GPUs using the optimized and customizable TensorRT library. You’ll learn the tricks we used to optimize the GNMT model, many of which are applicable to other auto-regressive models and to DL inference on GPU in general.

Watch this session
Join in the conversation below.