Model run using nvinferserver occupying high GPU memory-usage

• Hardware Platform - RTX 2080
• DeepStream Version - 5.0
• NVIDIA GPU Driver Version - 440.33.01

I have been trying to run a personalized model for face detection, which produces output tensor of shape [25270, 6]. The number of rows is for different ROIs, and the 6 values are for bbox(first 4), confidence, class respectively. I have written a custom parse function to populate the NvDsInferObjectDetectionInfo object and push it in objectList. I have provided the parameter for NMS in the inference config, and the setup is giving expected output.
But the issue is while running the pipeline for single source, batch size set to 1 the application is occupying approx 8000 MiB of Memory-Usage as given by nvidia-smi which is almost 90% memory of GPU. None of the example model provided in deepstream exceeds 1000Mib usage.
Is this expected behaviour?


It seems this issue is about nvinfer, not nvinferserver.
But anyway, I think we require detailed information about your DS and nvinfer setups, and your face detection model, so that we can setup a similar environment to reproduce your problems.

Do you mind sharing your working directory including your detection model under /opt/nvidia/deepstream/deepstream-5.0/sources/?

Hi @ersheng,

I have made a custom pipeline where I have replaced the nvinfer plugin with nvinferserver from deepstream-test1. It is inside a folder named fd_tri_v which contains the config file, nvdsinfer_custom_impl_Yolo folder for output parsing function and the deepstream app file.

The config file is config_tri_fd.txt(custom) where I followed the documentation for reference. I have changed the function NvDsInferParseCustomYoloTLT in nvdsinfer_custom_impl_Yolo/nvdsparsebbox_Yolo.cpp and used the corresponding .so file to suit my network tensor output.
contents of the config_tri_fd:

infer_config {
unique_id: 1
gpu_ids: [0]
max_batch_size: 1
backend {
trt_is {
model_name: “fd”
version: -1
model_repo {
root: “…/models”
strict_model_config: true
preprocess {
network_format: IMAGE_FORMAT_RGB
tensor_order: TENSOR_ORDER_NONE
maintain_aspect_ratio: 1
frame_scaling_hw: FRAME_SCALING_HW_DEFAULT
frame_scaling_filter: 1
normalize {
scale_factor: 1
channel_offsets: [0, 0, 0]
postprocess {
labelfile_path: “…/models/fd/labels_fd.txt”
detection {
num_detected_classes: 1
nms {
confidence_threshold: 0.5
iou_threshold: 0.3
topk : 20
extra {
copy_input_to_host_buffers: false
input_control {
operate_on_gie_id: -1
interval: 0

The model I am using is similar to tiny yolov3-spp model( trained for face detection. I have the model graphdef.
I have followed the directions as suggested for running graphdef model from the link:

My working directory is:


What format of Yolo model are you doing with? tensorflow? tensorRT or Caffe?

Hi @ersheng,
I am using a tensorflow graphdef format.


Sorry for the long wait.
Since this issue is a little complicated, we have make some discussions and here you are some conclusions from us.

  1. When running TensorFlow models using Triton Inference Server, the GPU device memory may fall short. The allowed GPU device memory allocation for TensorFlow models can be tuned using the ‘tf_gpu_memory_fraction’ parameter in the nvdsinferserver’s config files (config_infer_*). A larger value would reserve more GPU memory for TensorFlow per process, it is possible to have better performance but may also cause Out-Of-Memory or even core dump. The suggested
    value range is [0.2, 0.6].
  2. To learn more details of each parameter, go to section “Gst-nvinferserver” in

You can tune tf_gpu_memory_fraction to a smaller value to force Tensorflow limit GPU memories. the default 0 means NO gpu limitation for Tensorflow component.

infer_config{ backend { trt_is { model_repo {
}}} }

Besides that, to improve TF model performance, you can also try online TF-TRT conversion by appending the following block into config.pbtxt for Triton server

optimization { execution_accelerators {
  gpu_execution_accelerator : [ {
    name : "tensorrt"
    parameters { key: "precision_mode" value: "FP16" }
    parameters { key: "max_workspace_size_bytes" value: "512000000"}

Hi @ersheng
I tuned the parameter tf_gpu_memory_fraction to smaller fractions to test the throughput, which did not affect the throughput much. I will have to test some other models where this memory issue was happening.
I had earlier tried TF-TRT conversion but failed, due to some unsupported layer operation.
Could you direct me to links for proper TF_TRT conversion for Trition server? Also, I could not find online TF-TRT converter as suggested above.



Have you tried this?

Hi @ersheng,
I had tried the suggested tf-trt guide but it get trt_engine_opts as 0. So I tried the above suggested changes to config.pbtxt for optimization which converted portion of the graphs to trt_engines.
I further had trouble running the converted on-the-fly model with changed config which I have created a topic for, here is the reference:

Basically the parameter to be given for Tf-trt conversion or the model itself seems to be an issue.


Please open a new forum topic for your new question so that we can easily trace these topics.