I recently deployed a machine learning model utilizing GPU on two parallel servers. The request load was balanced between them using a load balancer. To reduce resource usage, I decided to switch to a single-server setup by routing all requests directly to one of the servers. The server has a Nvidia tesla t4 GPU. The model is using around 1.2GBs only out of 15GBs capacity.
After the change, I observed the following:
CPU Utilization: Remained almost unchanged.
GPU Utilization: Doubled as expected but did not exceed the GPU’s capacity.
Average GPU utilization % went from 7.5% to 16.5%
Max GPU utilization % went from 38% to 48%
90th percentile of GPU utilization % went from 19% to 34%
Model Response Time: Increased by about 10% to 15% on average.
Despite these observations, I’m unable to pinpoint the exact cause of the increased response time from the model. Any insights or suggestions?