YOLOv8-l Segmentation, causes sudden spikes in GPU load, when two streams are parallely inferencing

Please provide complete information as applicable to your setup.

• Hardware Platform (Jetson / GPU) A10 or A30
• DeepStream Version 6.2
• JetPack Version (valid for Jetson only)
• TensorRT Version 8.5
• NVIDIA GPU Driver Version (valid for GPU only) 12.3
• Issue Type( questions, new requirements, bugs) Bug
**• How to reproduce the issue ?
When the deepstream app with rtsp output is executed. and two or more streams are inferenced simultanously, if the average load is 30 % for 12 FPS, on getting multiple objects are segmented, the GPU load spikes upto 60 to 70%.
• Requirement details( This is for new requirement. Including the module name-for which plugin or for which sample application, the function description)
GitHub - marcoslucianops/DeepStream-Yolo-Seg: NVIDIA DeepStream SDK 6.3 / 6.2 / 6.1.1 / 6.1 / 6.0.1 / 6.0 implementation for YOLO-Segmentation models is the link I used for geting parser’s custom library. Even for onnx Conversion too.
What could be the root cause for this. If there is a fix how could i fix it.

When processing multiple objects are segmented, it is normal for the gpu loading to rise. You can also use the system profiling tool Nsight Systems | NVIDIA Developer to check the loadings.
User Guide :: Nsight Systems Documentation (nvidia.com)

Hi @yuweiw, thanks for the reply, Sorry I couldnt reply earlier, I have observed few root causes that might be causing the spike, please give your insights on this.

  1. We are getting the segmented mask array in 160X160 res, which is the output from the engine model. To fetch the contours of the segmented values, I am resizing and rescaling as shown in the deepstream_segmentation.py examples of deeptstream_python_apps repo. On doing this, it holds the pipeline in a threadlock, as it is taking huge load. If i have to resize this way in custom library, bboxs values are normalised and standardised, to the image resolution is there way I can do that.
  2. Even without the point one functionality, When i add the property of display_mask in NvOSD it causes this spike.

I also have one more issue, When i stream two different footages into the pipeline, and if only one stream is having objects it would not even detect but if both gets objects then it detects but causes this spike.

  1. When you get the mask data from the python, this involves a memory copy from the gpu to the cpu.
  2. If you set the display_mask , the nvdsosd plugin will draw the mask on the image.
    Both of the above two factors will lead to increase of the gpu loading. It’s open source now, you can check that in the sources\gst-plugins\gst-nvdsosd\gstnvdsosd.c file.

About your new issue, this is weird. Maybe there’s something wrong with your config file. Could you attach your project that we could run on our platform? We can try that.

  1. Well is there any difference between trtexec conversion and deepstream in built tensorrt conversion.
  2. What does qps for tensorrt , could you explain in layman terms.
  3. One more finding is that When .pt model is exported using yolocli to .engine its inference is pretty awesome. But when converted using [GitHub - marcoslucianops/DeepStream-Yolo-Seg: NVIDIA DeepStream SDK 6.3 / 6.2 / 6.1.1 / 6.1 / 6.0.1 / 6.0 implementation for YOLO-Segmentation models repo](https://DeepStream-Yolo-Seg repo), its inference is reduced approx by 30% compared to yolo CLI.

1.If they were all onnx files, they would be the same.
2.You can simply think of it as the inference speed of the model.
3.It is possible that the process of converting your pt file to onnx file lost precision.

For Re Creation you can take any yolov8 model (a genral weights one), convert it with Deepstream Yolo Seg, and use the deepstream-seg-mask, pipeline structre. Nothing more different than what i am using for now.

  1. What we find is that any onnx converted from trtexec is better, with respect to inferencing discretely.
  2. Does it have anything to do with this issue, because it seems to be solved.

[quote=“neeraj.sj, post:4, topic:284928”]
I also have one more issue, When i stream two different footages into the pipeline, and if only one stream is having objects it would not even detect but if both gets objects then it detects but causes this spike.
[/quote].

  1. Do you guys have any default ways to load direct, yolo cli converted engine files, I mean in terms of nvinfer’s custom library for yolov8.

Could you describe the following two ways in detail?
Like the cli to generate the engine, how to get the 30% diffs, the detailed operation procedure step by step, etc…

The most possibility may be the one I mentioned:

It seems that you use .pt file to generate the engine directly, but use the onnx file when you use deepstream.

If you want to load your engine file directly, you only need to config the model-engine-file parameter.

In yolov8 cli from ultralytics,
yolo export model=yolov8l-seg.pt format=engine or yolo export model=yolov8l-seg.pt format=onnx
To infer on this file, the libcustomsegparser.so doesnt support as it is coded for the conversion for,
Deepstream-Yolo-Seg Repo,

but when I run the yolo cli gave pretty awesome detections with no GPU spikes and load was also decent.
But Deepstream works to solve this issues right. Where am I missing here?
Which keeping the accuracy intact and improving the performance with speed and Load.

Are you saying deepstream is better? We haven’t run yolov8 before. Can you post the detailed steps of the 2 methods to run? So that we can check that on our side.

I meant was, Deepstream is implemented for higher and better performance which includes lower gpu load and better infernce handling. So ideally it should have better ways to handle yolov8. Like making yolo CLI based tensorrt engine compatible wrt libcustombbox&segparser right. Which is comparitevely better for single stream as per what I have seen.

Ok here are the steps,

  1. Use this Yolov8.pt Model,
  2. Follow the steps as per this YoloV8-docs to convert to Onnx.
  3. Compile this folder for nvdsinfer_custom_impl_Yolo_seg.
  4. Copy this config file
  5. Run this deepstream code example deepstream-seg-mask.py with above config file.
    And you are able to run this file.

For verifying this just load the model by export from yolo CLI not with the above mentioned, docs.
using the same yolo CLI, predict on the tensorrt file.
You can refer this

DeepDtream makes good use of GPU acceleration while also minimizing the memory copy. So it is implemented for higher and better performance which includes lower gpu load.
As for the accuracy, it may be that DeepStream has a less loss on the process of the conversions and scaling of the images.

Well then, thats what we need to resolve upto too right,
And why are we mapping the segmentation masks to frame in NvOSD, in CPU mode, which ideally needs to be in GPU ?, do we have any better solution for that.

Drawing mask can be configured with nvdsosd->frame_mask_params->mode in our sources\gst-plugins\gst-nvdsosd\gstnvdsosd.c. You can set that to cpu mode or gpu mode.
We have no way to provide the yolo cli optimization measures, only DeepStream related support. Thanks

What I am requesting is the custombboxparsinglibrary. for yolo cli converted engine as an example code like other deepstream-python-apps, And for segmentation deepstream’s NvOSD doesnt support GPU mode, was one error.

Well, Thanks for the inputs regarding. Great product in all complexities. Thank you so much.

You can just set the process_mode to 1 to support the GPU mode of the nvdsosd plugin.

While doing it for masking it gives an error saying that, Currently Deepstream doesnt support mask on gpu please set it to CPU mode.

Is it fixed in newer version of deepstream.

There is no update from you for a period, assuming this is not an issue anymore. Hence we are closing this topic. If need further support, please open a new one. Thanks

Yes. You can upgrade to the DeepStream 6.4 version.