My team and I currently deployed a 4-GPU (2080Ti) on-prem inference server. We’ve connected the metrics to Prometheus/Grafana, and we are not seeing much GPU utilization (max 7% usage) and lower than expected computation time.
Here are some settings we have configured:
- Using HTTP requests instead of gRPC
- We are mostly running un-batched inference calls, however we tried using dynamic batching and it doesn’t seem to help
- We created instance groups of up to 30, but doesn’t seem to effect our run time
- We have 3 custom models, and also using retinanet object classifier; the retinaet runs the slowest.
We feel that we are stuck on the first gear of a Porche, and we don’t know how to go faster. We would like to hear any suggestions. Also, if there’s a TensorRT expert reading this and would like to consult with us, please let me know. There’s an urgency in fixing this.