Why inference speedup increases with the increase of batch size in tensorrt int8?

In http://on-demand.gputechconf.com/gtc/2017/presentation/s7310-8-bit-inference-with-tensorrt.pdf silde 31, we can see the performance improved when batch size is larger. But why this happens?


Larger batches generally enable more efficient use of GPU resources. For example, batch sizes using multiples of 32 may be particularly fast and efficient on V100 and Tesla T4 GPUs because TensorRT can use special kernels for matrix multiply and fully connected layers that leverage Tensor Cores.