Performance issue with dynamic batching on Triton Inference Server

yash.mandilwar · May 6, 2021, 6:54pm

We have tried testing Triton Inference Server to deploy models from tensorflow model zoo. We compared performance of EfficientDet-D1 (small model) and EfficientDet-D7 (large model) with and without Triton Inference Server. Models in Tensorflow 2 model zoo do not have dynamic batching enabled by default. We have to export it on our own using their code.

Here are our observations.

Performance on Nvidia T4 without triton:

EfficientDet-D1 : 3.86 FPS
EfficientDet-D7 : 0.79 FPS

Performance on Nvidia T4 with triton:

EfficientDet-D1 (no dynamic batching) : 5.2 FPS (GPU utilization : upto 45%)
EfficientDet-D1 (dynamic batching) : 4.9 FPS (GPU utilization : upto 37%)
EfficientDet-D7 (no dynamic batching) : 0.95 FPS (GPU utilization : upto 100%)
EfficientDet-D7 (dynamic batching) : 0.95 FPS (GPU utilization : upto 100%)

So we see some boost in performance in Triton but not to the extent we expected.
As I understand, dynamic batching helps in better utilization of the GPU. But we didn’t see any advantage in using dynamic batch enabled models.

System:
AWS g4dn, NVIDIA T4

config file:
This is the config file for EfficientDet-D1 (supports dynamic batching)

name: "tf_savedmodel_effdet1"
platform: "tensorflow_savedmodel"
max_batch_size: 64
input [
  {
    name: "input_tensor"
    data_type: TYPE_UINT8
    format: FORMAT_NONE
    dims: [ 720, 1280, 3 ]
  }
]
output [
  {
    name: "detection_scores"
    data_type: TYPE_FP32
    dims: [ 100 ]
    label_filename: "coco_labels.txt"
  },
  {
    name: "detection_boxes"
    data_type: TYPE_FP32
    dims: [ 100 , 4]
    label_filename: "coco_labels.txt"
  }
]
dynamic_batching {
  preferred_batch_size: [1,2,4,8,16,32,64]
  max_queue_delay_microseconds: 30000
}

Server Run Command :

docker run --gpus=1 --rm -p 8000:8000 -p 8001:8001 -p 8002:8002 -v /home/ubuntu/model_repository:/models nvcr.io/nvidia/tritonserver:21.03-py3 tritonserver --model-repository=/models

Client side Info:
Using HTTP, async=False

Request to server:

responses.append(
triton_client.infer(FLAGS.model_name,
inputs,
request_id=str(sent_count),
model_version=FLAGS.model_version,
outputs=outputs))

Do let me know if any more information is needed. Thanks!