Today I came across a talk introducing their product architecture which contains 4x Nvidia DGX-1 achieving ~600µs latency and sustainable throughput of 4GB/s doing inference on ImageNet classification models like resnet50, inception v3, etc., where batch_size = 256 or 128. They claim that they feed the GPU whatever data the GPU wants, so all the 32 GPUs are >95% busy. That is how they achieve high throughput and low latency at the same time.
This is hard to believe for me. I read that Nvidia Tesla P4 and Tesla V100 can deliver 1.8 and 1.1ms latencies,
respectively, at a batch size of one (https://images.nvidia.com/content/pdf/inference-technical-overview.pdf). For higher batch size , there is usually high latency penalty.
What do you think of the product in the talk that achieves ~600 µs latency at a high batch size and throughput?