Help with increasing performance on TensorRT Inference Server

Hello,

My team and I currently deployed a 4-GPU (2080Ti) on-prem inference server. We’ve connected the metrics to Prometheus/Grafana, and we are not seeing much GPU utilization (max 7% usage) and lower than expected computation time.

Here are some settings we have configured:

  • Using HTTP requests instead of gRPC
  • We are mostly running un-batched inference calls, however we tried using dynamic batching and it doesn’t seem to help
  • We created instance groups of up to 30, but doesn’t seem to effect our run time
  • We have 3 custom models, and also using retinanet object classifier; the retinaet runs the slowest.

We feel that we are stuck on the first gear of a Porche, and we don’t know how to go faster. We would like to hear any suggestions. Also, if there’s a TensorRT expert reading this and would like to consult with us, please let me know. There’s an urgency in fixing this.