Performance Optimization and Troubleshooting in GPU-Based Machine Learning Deployments

I recently deployed a machine learning model utilizing GPU on two parallel servers. The request load was balanced between them using a load balancer. To reduce resource usage, I decided to switch to a single-server setup by routing all requests directly to one of the servers. The server has a Nvidia tesla t4 GPU. The model is using around 1.2GBs only out of 15GBs capacity.

After the change, I observed the following:

CPU Utilization: Remained almost unchanged.

GPU Utilization: Doubled as expected but did not exceed the GPU’s capacity.

Average GPU utilization % went from 7.5% to 16.5%

Max GPU utilization % went from 38% to 48%

90th percentile of GPU utilization % went from 19% to 34%

Model Response Time: Increased by about 10% to 15% on average.

Despite these observations, I’m unable to pinpoint the exact cause of the increased response time from the model. Any insights or suggestions?

Hello @sourabh.ojha13196, welcome to the NVIDIA developer forums.

I am not sure there are many in this particular corner of the forums who might have ideas here.
From the top of my head I would start looking at system memory utilization. Even if the CPU util does not change, there still will be much more memory and also network bandwidth being used.

But I am not the expert, I would suggest you check out also other parts of the forums like Deep Learning (Training & Inference) - NVIDIA Developer Forums or Accelerated Computing - NVIDIA Developer Forums. It might also be helpful in those categories to share a bit detail about your Model and what you are trying to achieve.

Thanks!

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.