Model run using nvinferserver occupying high GPU memory-usage

duttaneil16 · June 25, 2020, 5:34am

• Hardware Platform - RTX 2080
• DeepStream Version - 5.0
• NVIDIA GPU Driver Version - 440.33.01

Hi,
I have been trying to run a personalized model for face detection, which produces output tensor of shape [25270, 6]. The number of rows is for different ROIs, and the 6 values are for bbox(first 4), confidence, class respectively. I have written a custom parse function to populate the NvDsInferObjectDetectionInfo object and push it in objectList. I have provided the parameter for NMS in the inference config, and the setup is giving expected output.
But the issue is while running the pipeline for single source, batch size set to 1 the application is occupying approx 8000 MiB of Memory-Usage as given by nvidia-smi which is almost 90% memory of GPU. None of the example model provided in deepstream exceeds 1000Mib usage.
Is this expected behaviour?

ersheng · July 1, 2020, 4:48am

@duttaneil16

It seems this issue is about nvinfer, not nvinferserver.
But anyway, I think we require detailed information about your DS and nvinfer setups, and your face detection model, so that we can setup a similar environment to reproduce your problems.

Do you mind sharing your working directory including your detection model under /opt/nvidia/deepstream/deepstream-5.0/sources/?
Thanks.

duttaneil16 · July 1, 2020, 8:07am

Hi @ersheng,

I have made a custom pipeline where I have replaced the nvinfer plugin with nvinferserver from deepstream-test1. It is inside a folder named fd_tri_v which contains the config file, nvdsinfer_custom_impl_Yolo folder for output parsing function and the deepstream app file.

The config file is config_tri_fd.txt(custom) where I followed the documentation for reference. I have changed the function NvDsInferParseCustomYoloTLT in nvdsinfer_custom_impl_Yolo/nvdsparsebbox_Yolo.cpp and used the corresponding .so file to suit my network tensor output.
contents of the config_tri_fd:

infer_config {
unique_id: 1
gpu_ids: [0]
max_batch_size: 1
backend {
trt_is {
model_name: “fd”
version: -1
model_repo {
root: “…/models”
strict_model_config: true
}
}
}
preprocess {
network_format: IMAGE_FORMAT_RGB
tensor_order: TENSOR_ORDER_NONE
maintain_aspect_ratio: 1
frame_scaling_hw: FRAME_SCALING_HW_DEFAULT
frame_scaling_filter: 1
normalize {
scale_factor: 1
channel_offsets: [0, 0, 0]
}
}
postprocess {
labelfile_path: “…/models/fd/labels_fd.txt”
detection {
num_detected_classes: 1
custom_parse_bbox_func:“NvDsInferParseCustomYoloTLT”
nms {
confidence_threshold: 0.5
iou_threshold: 0.3
topk : 20
}
}
}
extra {
copy_input_to_host_buffers: false
}
custom_lib{
path:“nvdsinfer_custom_impl_Yolo/libnvdsinfer_custom_impl_Yolo.so”
}
}
input_control {
process_mode: PROCESS_MODE_FULL_FRAME
operate_on_gie_id: -1
interval: 0
}

The model I am using is similar to tiny yolov3-spp model(GitHub - AlexeyAB/darknet: YOLOv4 / Scaled-YOLOv4 / YOLO - Neural Networks for Object Detection (Windows and Linux version of Darknet )) trained for face detection. I have the model graphdef.
I have followed the directions as suggested for running graphdef model from the link:
https://developer.nvidia.com/blog/building-iva-apps-using-deepstream-5.0/

My working directory is:
/opt/nvidia/deepstream/deepstream-5.0/sources/apps/sample_apps/fd_tri_v

ersheng · July 9, 2020, 7:34am

@duttaneil16

What format of Yolo model are you doing with? tensorflow? tensorRT or Caffe?

duttaneil16 · July 9, 2020, 8:39am

Hi @ersheng,
I am using a tensorflow graphdef format.

ersheng · July 9, 2020, 9:20am

@duttaneil16

Sorry for the long wait.
Since this issue is a little complicated, we have make some discussions and here you are some conclusions from us.

When running TensorFlow models using Triton Inference Server, the GPU device memory may fall short. The allowed GPU device memory allocation for TensorFlow models can be tuned using the ‘tf_gpu_memory_fraction’ parameter in the nvdsinferserver’s config files (config_infer_*). A larger value would reserve more GPU memory for TensorFlow per process, it is possible to have better performance but may also cause Out-Of-Memory or even core dump. The suggested
value range is [0.2, 0.6].
To learn more details of each parameter, go to section “Gst-nvinferserver” in
https://docs.nvidia.com/metropolis/deepstream/plugin-manual/index.html

You can tune tf_gpu_memory_fraction to a smaller value to force Tensorflow limit GPU memories. the default 0 means NO gpu limitation for Tensorflow component.

infer_config{ backend { trt_is { model_repo {
    tf_gpu_memory_fraction:0.3
}}} }

Besides that, to improve TF model performance, you can also try online TF-TRT conversion by appending the following block into config.pbtxt for Triton server

optimization { execution_accelerators {
  gpu_execution_accelerator : [ {
    name : "tensorrt"
    parameters { key: "precision_mode" value: "FP16" }
    parameters { key: "max_workspace_size_bytes" value: "512000000"}
}]
}}

duttaneil16 · July 10, 2020, 5:18am

Hi @ersheng
I tuned the parameter tf_gpu_memory_fraction to smaller fractions to test the throughput, which did not affect the throughput much. I will have to test some other models where this memory issue was happening.
I had earlier tried TF-TRT conversion but failed, due to some unsupported layer operation.
Could you direct me to links for proper TF_TRT conversion for Trition server? Also, I could not find online TF-TRT converter as suggested above.

Thanks.

ersheng · August 4, 2020, 10:55am

@duttaneil16

Have you tried this?
https://docs.nvidia.com/deeplearning/frameworks/tf-trt-user-guide/index.html#worflow-with-savedmodel

duttaneil16 · August 5, 2020, 6:37pm

Hi @ersheng,
I had tried the suggested tf-trt guide but it get trt_engine_opts as 0. So I tried the above suggested changes to config.pbtxt for optimization which converted portion of the graphs to trt_engines.
I further had trouble running the converted on-the-fly model with changed config which I have created a topic for, here is the reference:

Basically the parameter to be given for Tf-trt conversion or the model itself seems to be an issue.

ersheng · August 12, 2020, 6:47am

@duttaneil16

Please open a new forum topic for your new question so that we can easily trace these topics.
Thanks.

Topic		Replies	Views
TensorRT model memory usage in NvInfer vs NvInferserver plugin DeepStream SDK tensorrt , nvbugs	5	600	July 10, 2023
Too much frame drop in deepstream pipeline DeepStream SDK cuda , jetson-inference , gstreamer , deepstream	20	164	February 12, 2025
Nvinfer's results are different from nvinferserver DeepStream SDK tensorrt , camera , gstreamer , nvbugs	21	1308	September 11, 2023
Yolov8 nvinferserver fp16 not working DeepStream SDK	8	1008	September 20, 2023
Deploying Models from TensorFlow Model Zoo Using NVIDIA DeepStream and NVIDIA Triton Inference Server DeepStream SDK	3	8918	February 29, 2024
Conversion of TF-TRT model to Deepstream errors DeepStream SDK	8	1977	October 12, 2021
Custom Model deployment on deepstream DeepStream SDK tensorrt , cuda	8	628	June 14, 2022
Error in NvDsInferContextImpl::parseBoundingBox() DeepStream SDK	8	1959	October 12, 2021
Run tensorflow savedmodel on nvinferserver in DS DeepStream SDK inference-server-triton , deepstream	9	519	August 26, 2022
Nvinfer giving warning for Nvpreprocess adding for Yolov5 DeepStream SDK python	3	431	August 16, 2023

Model run using nvinferserver occupying high GPU memory-usage

Related topics