Low GPU utilization

I have set up 50 pipelines for streaming decoding, inference, and post-processing using the DeepStream framework. I have measured the inference speed and found it to be slow, with low GPU utilization. The video streams I am using are h264, 1080p, and 25 fps. I have 4 T4 GPUs, and the inference model is yolov7. My cloud server is configured with Ubuntu 18.04, CUDA 11.4 Update 1, TensorRT 8.0 GA (8.0.1), NVIDIA Driver 470.63.01, NVIDIA DeepStream SDK 6.0, and GStreamer 1.14.5. I have already set the GPU ID for decoding and inference, so there should be no GPU-to-GPU copy operations. The GPU utilization for all four T4 GPUs is approximately 40%-50%. I only require the inference time to be less than 80ms; other times are not a concern. Therefore, I would like to know if the 4 T4 GPUs can support 50 pipelines (using the yolov7 model) simultaneously for streaming decoding, inference, and post-processing.

I have measured the inference time, and it is over 400 milliseconds.

I am using local videos, not camera videos.

I have tested using two RTX 3090 GPUs, and I also observed low GPU utilization. Similarly, when using two 4080 GPUs for testing, I encountered the same issue of low GPU utilization.

Have you run these pipelines in different GPUs?

If it is not, the bottleneck may be the hardware video codec. According to the data Video Code SDK | NVIDIA Developer, T4 only support 34 x 1080p@30fps streams decoding. You can use “nvidia-smi dmon” to check the decoder&encoder loading while the pipelines are running.

I have just conducted testing on my local server. The environment and configuration of my local server are different from the cloud server. The local server environment consists of Ubuntu 18.04, CUDA 11.4, TensorRT 8.4.2-1, DeepStream 6.0, and two 4080 GPUs with driver version 530.41.03. I have enabled 24 pipelines and used the yolov7 model for inference. I monitored the GPU utilization using ‘watch -n 1 nvidia-smi’ and ‘nvidia-smi dmon’ commands. However, the GPU utilization remained low, and the inference time was measured at 244ms.

I have configured the decoding and inference of the pipelines on the same GPU to avoid GPU-to-GPU copying. In this case, the 24 pipelines are evenly distributed across two GPUs, meaning that 12 pipelines are simultaneously performing decoding and inference on one GPU.

So do you have any suggestions?

May I ask if you have any suggestions? I have already done as you advised, but I don’t think decoding is the bottleneck, so are there any other possibilities?

What is your pipeline? What kind of sources are you testing on? The RTSP streams or local files? What is the sink of the pipeline? You have mentioned the inference is slow, how did you measure the inference speed? How slow it is?

There are some instructions for performnace measurement: Performance — DeepStream 6.2 Release documentation

My pipeline is shown in the diagram. All pipelines are independent. I am testing on local video stream files in the H264 format, with a resolution of 1080P and a frame rate of 25 FPS. The pipeline outputs both the inference results and some post-processing results. To measure the inference speed, I calculate the processing time for a frame of image through the detectorbin. In a recent test on a cloud server with four T4 GPUs, I found that the average inference speed for 40 pipelines was 100ms. Most pipelines had an inference speed of 4-5ms, while some pipelines took around one to two thousand milliseconds. The inference model used is yolov7. The GPU utilization and decoding utilization screenshots are shown in the following image.

what I want is the inference time of each pipeline to be less than 80ms.It seems that the utilization of T4 GPUs has not reached its maximum potential. I would like to know how to increase GPU utilization.

I have another piece of information to provide. The previous environment of this system was able to run up to 60 pipelines with an inference time of around 5ms. The GPU utilization could reach 70%. However, after the system was reinstalled, it was unable to achieve the same performance. Could it be due to the environment? The environment for T4 has been mentioned earlier.

What is the “GstTEP” in your pipeline?

Please answer the questions one by one.

Your pipeline input single stream for each pipeline. The inferencing batch size is 1. It is a waste of GPU capability.
Any of the component can be the bottleneck to reach the highest performance of the system. Please read the instructions for performnace measurement first: Performance — DeepStream 6.2 Release documentation

First question: I’m using local video sources, H264, 1080P, FPS25.
Second question: GSTTEP is our post-processing plugin that performs post-processing on the inference results of YOLO. GSTTEP is connected after the inference plugin. Even when we remove this plugin for testing, the GPU utilization does not increase.
Third question: The batch processing should not be an issue. Previously, we were able to achieve high GPU utilization and short inference times on the T4 graphics card. However, after reinstalling the environment, we couldn’t achieve the same results. Do I need to install the corresponding T4 graphics card driver and a DeepStream version compatible with T4?
The fourth question, can you provide me with a set of environment configuration lists for DeepStream 6.0 on T4, RTX3090, and RTX4080?

How about the CPU loading?


Please follow the compatibility Quickstart Guide — DeepStream 6.3 Release documentation

DeepStream 6.0 does not support RTX4080

RTX3090 driver: Linux x64 (AMD64/EM64T) Display Driver | 470.129.06 | Linux 64-bit | NVIDIA

  1. CPU loading is here.
  2. Does that mean the environment could be causing low GPU utilization? So, if I need to run YOLOv7 on Ubuntu 18.04 using a T4 GPU with DeepStream 6.0, what should be the versions of CUDA, cuDNN, and TensorRT.
    3.Is DeepStream currently not compatible with the RTX 40 series?

There is no update from you for a period, assuming this is not an issue anymore. Hence we are closing this topic. If need further support, please open a new one. Thanks

Looks good.


DeepStream 6.0 version is not compatible with RTX 4090. You need DeepStream 6.2 for RTX 4090. The compatibility is in Quickstart Guide — DeepStream 6.3 Release documentation

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.