Performance Optimization and Troubleshooting in GPU-Based Machine Learning Deployments

sourabh.ojha13196 · January 16, 2025, 1:56pm

I recently deployed a machine learning model utilizing GPU on two parallel servers. The request load was balanced between them using a load balancer. To reduce resource usage, I decided to switch to a single-server setup by routing all requests directly to one of the servers. The server has a Nvidia tesla t4 GPU. The model is using around 1.2GBs only out of 15GBs capacity.

After the change, I observed the following:

CPU Utilization: Remained almost unchanged.

GPU Utilization: Doubled as expected but did not exceed the GPU’s capacity.

Average GPU utilization % went from 7.5% to 16.5%

Max GPU utilization % went from 38% to 48%

90th percentile of GPU utilization % went from 19% to 34%

Model Response Time: Increased by about 10% to 15% on average.

Despite these observations, I’m unable to pinpoint the exact cause of the increased response time from the model. Any insights or suggestions?

MarkusHoHo · January 23, 2025, 8:31am

Hello @sourabh.ojha13196, welcome to the NVIDIA developer forums.

I am not sure there are many in this particular corner of the forums who might have ideas here.
From the top of my head I would start looking at system memory utilization. Even if the CPU util does not change, there still will be much more memory and also network bandwidth being used.

But I am not the expert, I would suggest you check out also other parts of the forums like Deep Learning (Training & Inference) - NVIDIA Developer Forums or Accelerated Computing - NVIDIA Developer Forums. It might also be helpful in those categories to share a bit detail about your Model and what you are trying to achieve.

Thanks!

system · March 6, 2025, 3:23am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Inconsistant GPU memory utilsation with parallel model instances Triton Inference Server - archived	0	805	July 5, 2021
Increasing GPU utilization of OpenCL kernel CUDA Programming and Performance	0	868	January 12, 2013
GPU memory allocated but GPU usage 0% CUDA Developer Tools	0	523	January 4, 2021
Relation between Utilization of CPU and GPU in HPC and Potential Reasons for Not Achieving Stable GPU Utilization CUDA Programming and Performance	1	422	October 7, 2022
Training Multiple Models in one GPU in linux Frameworks	0	643	November 3, 2022
Full processing only in one GPU, how to leverage the processing to another GPU? Frameworks tensorflow	1	1050	April 16, 2019
Why over usage of CPU and under usage of GPU - python tensorflow Frameworks tensorflow	1	661	September 3, 2019
Difference of memory usage at each GPU model during tensorflow c++ inference Frameworks tensorflow	3	1534	November 20, 2019
Why Multi-GPU slower than single GPUï¼Ÿ CUDA Programming and Performance	2	7599	September 14, 2011
Enquires about running jobs using multiple GPUs Technical Support (PhysicsNeMo Only)	1	940	April 14, 2023

Performance Optimization and Troubleshooting in GPU-Based Machine Learning Deployments

Related topics