We have tried testing Triton Inference Server to deploy models from tensorflow model zoo. We compared performance of EfficientDet-D1 (small model) and EfficientDet-D7 (large model) with and without Triton Inference Server. Models in Tensorflow 2 model zoo do not have dynamic batching enabled by default. We have to export it on our own using their code.
Here are our observations.
Performance on Nvidia T4 without triton:
- EfficientDet-D1 : 3.86 FPS
- EfficientDet-D7 : 0.79 FPS
Performance on Nvidia T4 with triton:
-
EfficientDet-D1 (no dynamic batching) : 5.2 FPS (GPU utilization : upto 45%)
-
EfficientDet-D1 (dynamic batching) : 4.9 FPS (GPU utilization : upto 37%)
-
EfficientDet-D7 (no dynamic batching) : 0.95 FPS (GPU utilization : upto 100%)
-
EfficientDet-D7 (dynamic batching) : 0.95 FPS (GPU utilization : upto 100%)
So we see some boost in performance in Triton but not to the extent we expected.
As I understand, dynamic batching helps in better utilization of the GPU. But we didn’t see any advantage in using dynamic batch enabled models.
System:
AWS g4dn, NVIDIA T4
config file:
This is the config file for EfficientDet-D1 (supports dynamic batching)
name: "tf_savedmodel_effdet1" platform: "tensorflow_savedmodel" max_batch_size: 64 input [ { name: "input_tensor" data_type: TYPE_UINT8 format: FORMAT_NONE dims: [ 720, 1280, 3 ] } ] output [ { name: "detection_scores" data_type: TYPE_FP32 dims: [ 100 ] label_filename: "coco_labels.txt" }, { name: "detection_boxes" data_type: TYPE_FP32 dims: [ 100 , 4] label_filename: "coco_labels.txt" } ] dynamic_batching { preferred_batch_size: [1,2,4,8,16,32,64] max_queue_delay_microseconds: 30000 }
Server Run Command :
docker run --gpus=1 --rm -p 8000:8000 -p 8001:8001 -p 8002:8002 -v /home/ubuntu/model_repository:/models nvcr.io/nvidia/tritonserver:21.03-py3 tritonserver --model-repository=/models
Client side Info:
Using HTTP, async=False
Request to server:
responses.append(
triton_client.infer(FLAGS.model_name,
inputs,
request_id=str(sent_count),
model_version=FLAGS.model_version,
outputs=outputs))
Do let me know if any more information is needed. Thanks!