Is it possible to achieve 600 µs latency and very high throughput?

jingjjiang · September 17, 2018, 9:32pm

Today I came across a talk introducing their product architecture which contains 4x Nvidia DGX-1 achieving ~600µs latency and sustainable throughput of 4GB/s doing inference on ImageNet classification models like resnet50, inception v3, etc., where batch_size = 256 or 128. They claim that they feed the GPU whatever data the GPU wants, so all the 32 GPUs are >95% busy. That is how they achieve high throughput and low latency at the same time.

This is hard to believe for me. I read that Nvidia Tesla P4 and Tesla V100 can deliver 1.8 and 1.1ms latencies,
respectively, at a batch size of one (https://images.nvidia.com/content/pdf/inference-technical-overview.pdf). For higher batch size , there is usually high latency penalty.

What do you think of the product in the talk that achieves ~600 µs latency at a high batch size and throughput?

Topic		Replies	Views
NVIDIA DGX-2 AI Supercomputer and New Tesla T4 Set Image Recognition Records for Training and Inference Technical Blog	0	265	August 21, 2022
White Paper: NVIDIA DGX-1 with Tesla V100 Technical Blog	0	305	August 21, 2022
TRT inference on batches is not giving any performance benefit Jetson TX2 tensorrt , nvbugs	11	1186	October 18, 2021
NVIDIA DGX-1: The Fastest Deep Learning System Technical Blog	2	478	April 17, 2020
The "GPU Compute Time" doesn't change, when setting different batch size TensorRT tensorrt	3	1208	July 8, 2022
Inference Speed Jetson Xavier NX pytorch	6	890	April 12, 2023
Benchmarking for batch sizes 64 and 128 Jetson AGX Xavier	1	352	December 9, 2019
Interpretation of "total aggregate bandwidth" for HGX A100 CUDA Programming and Performance a100	9	2850	June 3, 2024
Inside Pascal: NVIDIA's Newest Computing Platform Technical Blog	51	697	December 8, 2017
Performance data (latency) for VGG16 layer-by-layer inference on T4 Frameworks	0	631	May 16, 2021

Is it possible to achieve 600 &micro;s latency and very high throughput?

Related topics

Is it possible to achieve 600 µs latency and very high throughput?