DeepStream 6.0.1 Triton GRPC memory leak

• Hardware Platform (Jetson / GPU) : GPU
• DeepStream Version: 6.0.1 (nvcr.io/nvidia/deepstream:6.0.1-triton)
• TensorRT Version: 8.0.1
• NVIDIA GPU Driver Version: 470.82.00
• Issue Type: Bug, host memory leak
• How to reproduce the issue ?

I have tested the issue with the RTSP in-out Python sample. The backend is TensorRT. Running the sample with the default inference configuration files does not seem to leak memory, but replacing the nvinferserver configuration with settings that use a standalone Triton server instance and GRPC seems to leak memory consistently, proportionally to the input sources count/resolution/bitrate. The Python process is the one that leaks, so it doesn’t seem like a issue in Triton’s server side. The leak being proportional to the input size might indicate the nvinferserver grpc client isn’t correctly freeing frames, or is buffering indefinitely (does it even buffer/queue?). I have only tested the issue by running the standalone Triton server on the same machine as the DeepStream app, and I used YOLOv4 for the grpc test. Updating Triton server to r22.04 had no effect on the leak.

The Dockerfile and nvinferserver configuration I’ve used are as follows:

FROM nvcr.io/nvidia/deepstream:6.0.1-triton

ENV DEBIAN_FRONTEND=noninteractive
ENV TZ=Asia/Tokyo

WORKDIR /workspace

RUN apt-key del 7fa2af80
RUN apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/3bf863cc.pub

RUN apt-get update && apt-get install -y --no-install-recommends \
    zip \
    python3-gi python3-dev python3-gst-1.0 python-gi-dev git python-dev \
    python3 python3-pip python3.8-dev cmake g++ build-essential libglib2.0-dev \
    libglib2.0-dev-bin python-gi-dev libtool m4 autoconf automake \
    libgirepository1.0-dev \
    tzdata && \
    rm -rf /var/lib/apt/lists/*

RUN git clone https://github.com/NVIDIA-AI-IOT/deepstream_python_apps
WORKDIR /workspace/deepstream_python_apps
RUN git checkout 9bffad1aea802f6be4419712c0a50f05d6a2d490
RUN git submodule update --init
WORKDIR /workspace/deepstream_python_apps/3rdparty/gst-python/
RUN git config --global http.sslverify false
RUN sh autogen.sh
RUN make
RUN make install
WORKDIR /workspace/deepstream_python_apps/bindings
RUN mkdir build
WORKDIR /workspace/deepstream_python_apps/bindings/build
RUN cmake .. -DPYTHON_MAJOR_VERSION=3 -DPYTHON_MINOR_VERSION=8
RUN make
RUN pip3 install pyds-1.1.1-py3-none-linux_x86_64.whl

WORKDIR /workspace
RUN git clone https://github.com/NVIDIA-AI-IOT/yolov4_deepstream.git
RUN cp -r yolov4_deepstream/deepstream_yolov4 /opt/nvidia/deepstream/deepstream/sources/deepstream_yolov4
ENV CUDA_VER=11.4


WORKDIR /opt/nvidia/deepstream/deepstream/sources/deepstream_yolov4/nvdsinfer_custom_impl_Yolo
RUN make

infer_config {
    unique_id: 5
    gpu_ids: [0]
    max_batch_size: 1
    backend {
        triton {

            model_name: "main_model"
            version: -1
            grpc {
                url: "localhost:8001"
            }
        }
    }

    preprocess {
        network_format: IMAGE_FORMAT_RGB
        tensor_order: TENSOR_ORDER_LINEAR
        maintain_aspect_ratio: 0
        normalize {
            scale_factor: 0.003921569
            channel_offsets: [0, 0, 0]
        }
    }

    postprocess {
        labelfile_path: "labels.txt"
        detection {
            num_detected_classes: 80
            custom_parse_bbox_func: "NvDsInferParseCustomYoloV4"
            nms {
                confidence_threshold: 0.3
                iou_threshold: 0.5
                topk: 100
            }
        }
    }

    extra {
        copy_input_to_host_buffers: false
    }

    custom_lib {
        path: "/opt/nvidia/deepstream/deepstream/sources/deepstream_yolov4/nvdsinfer_custom_impl_Yolo/libnvdsinfer_custom_impl_Yolo.so"
    }
}

input_control {
    process_mode: PROCESS_MODE_FULL_FRAME
    interval: 0
}

output_control {
    output_tensor_meta: true
}

Hi @g-ogawa
do you mean you have an DS python application which are based on

and the nvinferserver config?

From your dockerfile, seems it’s not enough to get these setup, could you share a complete setup?

Thanks!

Hello @mchi, thanks for the response.

While my final application is different, I have observed the leak using the linked application (deepstream-rtsp-in-rtsp-out) as is, without modifications aside from setting the nvinferserver config to one that uses grpc and YOLOv4, and yes, that is also the repository I used to compile the post processing plugin for YOLOv4.

Right now I only have a retrained model converted that I cannot publish, but if needed I can convert the publicly available YOLOv4 model and tweak the dockerfile so it’s easier to run the leaky experiment, would that help?

yes, any repo of the memory leak you found is helpful.

Thanks!

I’ve made a simple script that runs the Triton instance and the deepstream-rtsp-in-rtsp-out app, and a Dockerfile to build an image that runs that script. Due to the model size I couldn’t attach it here, so I made it available on this Google Drive link. Please let me know when it’s downloaded so I can take it down.

I’ve included instructions on how to run it, but it should be enough to just build the image and run a container. The rtsp source needs to be provided, though the app will work with file:// inputs as well, so I suppose a sufficiently long video could be used instead.

I have included here the container memory usage when running the app for 30 minutes with a 1080p 10fps 4000 kb/s h264 input RTSP stream.

While this level of memory increase for one stream is tolerable, we’d like to run quite a few streams constantly on a system that won’t be restarted too often, so a constant memory usage increase would be a problem.

Got issue when tritonserver loading your model,
can you run the model properly?

Source is file:///opt/nvidia/deepstream/deepstream/samples/streams/sample_1080p_h264.mp4
Starting Triton…
I0519 11:24:34.845895 1368 metrics.cc:290] Collecting metrics for GPU 0: Tesla T4
I0519 11:24:34.846343 1368 shared_library.cc:108] OpenLibraryHandle: /opt/tritonserver/backends/pytorch/libtriton_pytorch.so
I0519 11:24:35.197487 1368 libtorch.cc:1029] TRITONBACKEND_Initialize: pytorch
I0519 11:24:35.197515 1368 libtorch.cc:1039] Triton TRITONBACKEND API version: 1.4
I0519 11:24:35.197521 1368 libtorch.cc:1045] ‘pytorch’ TRITONBACKEND API version: 1.4
I0519 11:24:35.197593 1368 shared_library.cc:108] OpenLibraryHandle: /opt/tritonserver/backends/tensorflow1/libtriton_tensorflow1.so
2022-05-19 20:24:35.355798: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
I0519 11:24:35.396670 1368 tensorflow.cc:2169] TRITONBACKEND_Initialize: tensorflow
I0519 11:24:35.396726 1368 tensorflow.cc:2179] Triton TRITONBACKEND API version: 1.4
I0519 11:24:35.396742 1368 tensorflow.cc:2185] ‘tensorflow’ TRITONBACKEND API version: 1.4
I0519 11:24:35.396763 1368 tensorflow.cc:2209] backend configuration:
{}
I0519 11:24:35.396845 1368 shared_library.cc:108] OpenLibraryHandle: /opt/tritonserver/backends/onnxruntime/libtriton_onnxruntime.so
I0519 11:24:35.399432 1368 onnxruntime.cc:1970] TRITONBACKEND_Initialize: onnxruntime
I0519 11:24:35.399450 1368 onnxruntime.cc:1980] Triton TRITONBACKEND API version: 1.4
I0519 11:24:35.399455 1368 onnxruntime.cc:1986] ‘onnxruntime’ TRITONBACKEND API version: 1.4
I0519 11:24:35.409558 1368 shared_library.cc:108] OpenLibraryHandle: /opt/tritonserver/backends/openvino/libtriton_openvino.so
I0519 11:24:35.423172 1368 openvino.cc:1193] TRITONBACKEND_Initialize: openvino
I0519 11:24:35.423191 1368 openvino.cc:1203] Triton TRITONBACKEND API version: 1.4
I0519 11:24:35.423197 1368 openvino.cc:1209] ‘openvino’ TRITONBACKEND API version: 1.4
I0519 11:24:35.854411 1368 pinned_memory_manager.cc:240] Pinned memory pool is created at ‘0x7ffa70000000’ with size 268435456
I0519 11:24:35.856374 1368 cuda_memory_manager.cc:105] CUDA memory pool is created on device 0 with size 67108864
I0519 11:24:35.858400 1368 backend_factory.h:45] Create TritonBackendFactory
I0519 11:24:35.858421 1368 plan_backend_factory.cc:49] Create PlanBackendFactory
I0519 11:24:35.858427 1368 plan_backend_factory.cc:56] Registering TensorRT Plugins
I0519 11:24:35.858476 1368 logging.cc:52] Registered plugin creator - ::GridAnchor_TRT version 1
I0519 11:24:35.858494 1368 logging.cc:52] Registered plugin creator - ::GridAnchorRect_TRT version 1
I0519 11:24:35.858508 1368 logging.cc:52] Registered plugin creator - ::NMS_TRT version 1
I0519 11:24:35.858524 1368 logging.cc:52] Registered plugin creator - ::Reorg_TRT version 1
I0519 11:24:35.858534 1368 logging.cc:52] Registered plugin creator - ::Region_TRT version 1
I0519 11:24:35.858554 1368 logging.cc:52] Registered plugin creator - ::Clip_TRT version 1
I0519 11:24:35.858564 1368 logging.cc:52] Registered plugin creator - ::LReLU_TRT version 1
I0519 11:24:35.858575 1368 logging.cc:52] Registered plugin creator - ::PriorBox_TRT version 1
I0519 11:24:35.858586 1368 logging.cc:52] Registered plugin creator - ::Normalize_TRT version 1
I0519 11:24:35.858611 1368 logging.cc:52] Registered plugin creator - ::ScatterND version 1
I0519 11:24:35.858625 1368 logging.cc:52] Registered plugin creator - ::RPROI_TRT version 1
I0519 11:24:35.858636 1368 logging.cc:52] Registered plugin creator - ::BatchedNMS_TRT version 1
I0519 11:24:35.858645 1368 logging.cc:52] Registered plugin creator - ::BatchedNMSDynamic_TRT version 1
I0519 11:24:35.858672 1368 logging.cc:52] Registered plugin creator - ::FlattenConcat_TRT version 1
I0519 11:24:35.858689 1368 logging.cc:52] Registered plugin creator - ::CropAndResize version 1
I0519 11:24:35.858704 1368 logging.cc:52] Registered plugin creator - ::DetectionLayer_TRT version 1
I0519 11:24:35.858715 1368 logging.cc:52] Registered plugin creator - ::EfficientNMS_ONNX_TRT version 1
I0519 11:24:35.858739 1368 logging.cc:52] Registered plugin creator - ::EfficientNMS_TRT version 1
I0519 11:24:35.858752 1368 logging.cc:52] Registered plugin creator - ::Proposal version 1
I0519 11:24:35.858762 1368 logging.cc:52] Registered plugin creator - ::ProposalLayer_TRT version 1
I0519 11:24:35.858785 1368 logging.cc:52] Registered plugin creator - ::PyramidROIAlign_TRT version 1
I0519 11:24:35.858800 1368 logging.cc:52] Registered plugin creator - ::ResizeNearest_TRT version 1
I0519 11:24:35.858814 1368 logging.cc:52] Registered plugin creator - ::Split version 1
I0519 11:24:35.858831 1368 logging.cc:52] Registered plugin creator - ::SpecialSlice_TRT version 1
I0519 11:24:35.858851 1368 logging.cc:52] Registered plugin creator - ::InstanceNormalization_TRT version 1
I0519 11:24:35.858863 1368 ensemble_backend_factory.cc:47] Create EnsembleBackendFactory
I0519 11:24:35.859050 1368 autofill.cc:138] TensorFlow SavedModel autofill: Internal: unable to autofill for ‘1’ due to no version directories
I0519 11:24:35.859078 1368 autofill.cc:151] TensorFlow GraphDef autofill: Internal: unable to autofill for ‘1’ due to no version directories
I0519 11:24:35.859101 1368 autofill.cc:164] PyTorch autofill: Internal: unable to autofill for ‘1’ due to no version directories
I0519 11:24:35.859129 1368 autofill.cc:196] ONNX autofill: Internal: unable to autofill for ‘1’ due to no version directories
I0519 11:24:35.859153 1368 autofill.cc:209] TensorRT autofill: Internal: unable to autofill for ‘1’ due to no version directories
W0519 11:24:35.859161 1368 autofill.cc:243] Proceeding with simple config for now
I0519 11:24:35.859170 1368 model_config_utils.cc:637] autofilled config: name: “1”

E0519 11:24:35.859913 1368 model_repository_manager.cc:1919] Poll failed for model directory ‘1’: unexpected platform type for 1
I0519 11:24:35.859953 1368 server.cc:504]
±-----------------±-----+
| Repository Agent | Path |
±-----------------±-----+
±-----------------±-----+

I0519 11:24:35.860049 1368 server.cc:543]
±------------±----------------------------------------------------------------±-------+
| Backend | Path | Config |
±------------±----------------------------------------------------------------±-------+
| tensorrt | | {} |
| pytorch | /opt/tritonserver/backends/pytorch/libtriton_pytorch.so | {} |
| tensorflow | /opt/tritonserver/backends/tensorflow1/libtriton_tensorflow1.so | {} |
| onnxruntime | /opt/tritonserver/backends/onnxruntime/libtriton_onnxruntime.so | {} |
| openvino | /opt/tritonserver/backends/openvino/libtriton_openvino.so | {} |
±------------±----------------------------------------------------------------±-------+

I0519 11:24:35.860065 1368 model_repository_manager.cc:570] BackendStates()
I0519 11:24:35.860083 1368 server.cc:586]
±------±--------±-------+
| Model | Version | Status |
±------±--------±-------+
±------±--------±-------+

I0519 11:24:35.860201 1368 tritonserver.cc:1718]
±---------------------------------±---------------------------------------------------------------------------------------------------------------------------------------+
| Option | Value |
±---------------------------------±---------------------------------------------------------------------------------------------------------------------------------------+
| server_id | triton |
| server_version | 2.13.0 |
| server_extensions | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_configuration system_shared_memory |
| | cuda_shared_memory binary_tensor_data statistics |
| model_repository_path[0] | /workspace/triton_server/main_model/ |
| model_control_mode | MODE_NONE |
| strict_model_config | 0 |
| pinned_memory_pool_byte_size | 268435456 |
| cuda_memory_pool_byte_size{0} | 67108864 |
| min_supported_compute_capability | 6.0 |
| strict_readiness | 1 |
| exit_timeout | 30 |
±---------------------------------±---------------------------------------------------------------------------------------------------------------------------------------+

I0519 11:24:35.860233 1368 server.cc:234] Waiting for in-flight requests to complete.
I0519 11:24:35.860241 1368 model_repository_manager.cc:534] LiveBackendStates()
I0519 11:24:35.860248 1368 server.cc:249] Timeout 30: Found 0 live models and 0 in-flight non-inference requests
I0519 11:24:35.860258 1368 triton_backend_manager.cc:101] unloading backend ‘pytorch’
I0519 11:24:35.860266 1368 triton_backend_manager.cc:101] unloading backend ‘tensorflow’
I0519 11:24:35.860278 1368 triton_backend_manager.cc:101] unloading backend ‘onnxruntime’
I0519 11:24:35.860292 1368 triton_backend_manager.cc:101] unloading backend ‘openvino’
error: creating server: Internal - failed to load all models

| model_repository_path[0] | /workspace/triton_server/main_model/ |

It seems you have edited the Triton command to be more verbose but in doing so accidentally changed the repository path from /workspace/triton_server to /workspace/triton_server/main_model, and so Triton is trying to read the version ‘1’ folder as a model folder, could you verify that?

My output is as follows:


 docker run --gpus all -it --rm --name=memtest dstest                            


=============================
== Triton Inference Server ==
=============================

NVIDIA Release 21.08 (build 26170506)

Copyright (c) 2018-2021, NVIDIA CORPORATION & AFFILIATES.  All rights reserved.

Various files include modifications (c) NVIDIA CORPORATION.  All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license
find: File system loop detected; '/usr/bin/X11' is part of the same file system loop as '/usr/bin'.

Source is file:///opt/nvidia/deepstream/deepstream/samples/streams/sample_1080p_h264.mp4
Starting Triton...
I0520 01:16:41.068271 44 metrics.cc:290] Collecting metrics for GPU 0: NVIDIA GeForce RTX 3080 Laptop GPU
I0520 01:16:41.407833 44 libtorch.cc:1029] TRITONBACKEND_Initialize: pytorch
I0520 01:16:41.407853 44 libtorch.cc:1039] Triton TRITONBACKEND API version: 1.4
I0520 01:16:41.407857 44 libtorch.cc:1045] 'pytorch' TRITONBACKEND API version: 1.4
2022-05-20 10:16:41.573706: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
I0520 01:16:41.657609 44 tensorflow.cc:2169] TRITONBACKEND_Initialize: tensorflow
I0520 01:16:41.657641 44 tensorflow.cc:2179] Triton TRITONBACKEND API version: 1.4
I0520 01:16:41.657653 44 tensorflow.cc:2185] 'tensorflow' TRITONBACKEND API version: 1.4
I0520 01:16:41.657658 44 tensorflow.cc:2209] backend configuration:
{}
I0520 01:16:41.681202 44 onnxruntime.cc:1970] TRITONBACKEND_Initialize: onnxruntime
I0520 01:16:41.681234 44 onnxruntime.cc:1980] Triton TRITONBACKEND API version: 1.4
I0520 01:16:41.681291 44 onnxruntime.cc:1986] 'onnxruntime' TRITONBACKEND API version: 1.4
I0520 01:16:41.713993 44 openvino.cc:1193] TRITONBACKEND_Initialize: openvino
I0520 01:16:41.714015 44 openvino.cc:1203] Triton TRITONBACKEND API version: 1.4
I0520 01:16:41.714023 44 openvino.cc:1209] 'openvino' TRITONBACKEND API version: 1.4
I0520 01:16:41.869380 44 pinned_memory_manager.cc:240] Pinned memory pool is created at '0x7fac8a000000' with size 268435456
I0520 01:16:41.869950 44 cuda_memory_manager.cc:105] CUDA memory pool is created on device 0 with size 67108864
I0520 01:16:41.872789 44 model_repository_manager.cc:1045] loading: main_model:1
HTTPConnectionPool(host='localhost', port=8000): Max retries exceeded with url: /v2/models/main_model/ready (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f19666cc070>: Failed to establish a new connection: [Errno 111] Connection refused'))
I0520 01:16:42.745015 44 logging.cc:49] [MemUsageChange] Init CUDA: CPU +525, GPU +0, now: CPU 796, GPU 488 (MiB)
I0520 01:16:42.746190 44 logging.cc:49] Loaded engine size: 151 MB
I0520 01:16:42.746309 44 logging.cc:49] [MemUsageSnapshot] deserializeCudaEngine begin: CPU 796 MiB, GPU 488 MiB
HTTPConnectionPool(host='localhost', port=8000): Max retries exceeded with url: /v2/models/main_model/ready (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f19666cc880>: Failed to establish a new connection: [Errno 111] Connection refused'))
W0520 01:16:43.031363 44 logging.cc:46] Using an engine plan file across different models of devices is not recommended and is likely to affect performance or even cause errors.
W0520 01:16:43.071772 44 metrics.cc:395] Unable to get power limit for GPU 0: Success
HTTPConnectionPool(host='localhost', port=8000): Max retries exceeded with url: /v2/models/main_model/ready (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f1966630070>: Failed to establish a new connection: [Errno 111] Connection refused'))
I0520 01:16:43.932277 44 logging.cc:49] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +757, GPU +324, now: CPU 1578, GPU 940 (MiB)
I0520 01:16:44.631479 44 logging.cc:49] [MemUsageChange] Init cuDNN: CPU +322, GPU +322, now: CPU 1900, GPU 1262 (MiB)
I0520 01:16:44.632937 44 logging.cc:49] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 1900, GPU 1244 (MiB)
I0520 01:16:44.633055 44 logging.cc:49] [MemUsageSnapshot] deserializeCudaEngine end: CPU 1900 MiB, GPU 1244 MiB
I0520 01:16:44.633068 44 plan_backend.cc:456] Creating instance main_model_0_0_gpu0 on GPU 0 (8.6) using model.plan
I0520 01:16:44.634922 44 logging.cc:49] [MemUsageSnapshot] ExecutionContext creation begin: CPU 1900 MiB, GPU 1244 MiB
I0520 01:16:44.635645 44 logging.cc:49] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +10, now: CPU 1900, GPU 1254 (MiB)
I0520 01:16:44.636879 44 logging.cc:49] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 1900, GPU 1262 (MiB)
I0520 01:16:44.638834 44 logging.cc:49] [MemUsageSnapshot] ExecutionContext creation end: CPU 1901 MiB, GPU 1436 MiB
I0520 01:16:44.639410 44 plan_backend.cc:859] Created instance main_model_0_0_gpu0 on GPU 0 with stream priority 0 and optimization profile default[0];
I0520 01:16:44.644311 44 model_repository_manager.cc:1212] successfully loaded 'main_model' version 1
I0520 01:16:44.644404 44 server.cc:504] 
+------------------+------+
| Repository Agent | Path |
+------------------+------+
+------------------+------+

I0520 01:16:44.644547 44 server.cc:543] 
+-------------+-----------------------------------------------------------------+--------+
| Backend     | Path                                                            | Config |
+-------------+-----------------------------------------------------------------+--------+
| tensorrt    | <built-in>                                                      | {}     |
| pytorch     | /opt/tritonserver/backends/pytorch/libtriton_pytorch.so         | {}     |
| tensorflow  | /opt/tritonserver/backends/tensorflow1/libtriton_tensorflow1.so | {}     |
| onnxruntime | /opt/tritonserver/backends/onnxruntime/libtriton_onnxruntime.so | {}     |
| openvino    | /opt/tritonserver/backends/openvino/libtriton_openvino.so       | {}     |
+-------------+-----------------------------------------------------------------+--------+

I0520 01:16:44.644609 44 server.cc:586] 
+------------+---------+--------+
| Model      | Version | Status |
+------------+---------+--------+
| main_model | 1       | READY  |
+------------+---------+--------+

I0520 01:16:44.644719 44 tritonserver.cc:1718] 
+----------------------------------+----------------------------------------------------------------------------------------------------------+
| Option                           | Value                                                                                                    |
+----------------------------------+----------------------------------------------------------------------------------------------------------+
| server_id                        | triton                                                                                                   |
| server_version                   | 2.13.0                                                                                                   |
| server_extensions                | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_confi |
|                                  | guration system_shared_memory cuda_shared_memory binary_tensor_data statistics                           |
| model_repository_path[0]         | /workspace/triton_server                                                                                 |
| model_control_mode               | MODE_NONE                                                                                                |
| strict_model_config              | 1                                                                                                        |
| pinned_memory_pool_byte_size     | 268435456                                                                                                |
| cuda_memory_pool_byte_size{0}    | 67108864                                                                                                 |
| min_supported_compute_capability | 6.0                                                                                                      |
| strict_readiness                 | 1                                                                                                        |
| exit_timeout                     | 30                                                                                                       |
+----------------------------------+----------------------------------------------------------------------------------------------------------+

I0520 01:16:44.648267 44 grpc_server.cc:4111] Started GRPCInferenceService at 0.0.0.0:8001
I0520 01:16:44.649076 44 http_server.cc:2803] Started HTTPService at 0.0.0.0:8000
I0520 01:16:44.691495 44 http_server.cc:162] Started Metrics Service at 0.0.0.0:8002
Triton is ready
Starting rtsp-in-rtsp-out app (Ctrl+C to terminate)...
command:
python3 deepstream_test1_rtsp_in_rtsp_out.py -i file:///opt/nvidia/deepstream/deepstream/samples/streams/sample_1080p_h264.mp4 -g nvinferserver
deepstream_test1_rtsp_in_rtsp_out.py:205: PyGIDeprecationWarning: Since version 3.11, calling threads_init is no longer needed. See: https://wiki.gnome.org/PyGObject/Threading
  GObject.threads_init()
W0520 01:16:45.073782 44 metrics.cc:395] Unable to get power limit for GPU 0: Success

(gst-plugin-scanner:76): GStreamer-WARNING **: 10:16:45.208: Failed to load plugin '/usr/lib/x86_64-linux-gnu/gstreamer-1.0/deepstream/libnvdsgst_udp.so': librivermax.so.0: cannot open shared object file: No such file or directory

(gst-plugin-scanner:76): GLib-GObject-WARNING **: 10:16:45.295: specified class size for type 'GstCompositor' is smaller than the parent type's 'GstVideoAggregator' class size

(gst-plugin-scanner:76): GLib-GObject-CRITICAL **: 10:16:45.295: g_type_add_interface_static: assertion 'G_TYPE_IS_INSTANTIATABLE (instance_type)' failed

(gst-plugin-scanner:76): GLib-CRITICAL **: 10:16:45.295: g_once_init_leave: assertion 'result != 0' failed

(gst-plugin-scanner:76): GStreamer-CRITICAL **: 10:16:45.295: gst_element_register: assertion 'g_type_is_a (type, GST_TYPE_ELEMENT)' failed
Creating Pipeline 
 
Creating streamux 
 
Creating source_bin  0  
 
Creating source bin
source-bin-00
Creating Pgie 
 
Creating tiler 
 
Creating nvvidconv 
 
Creating nvosd 
 
Creating H264 Encoder
Creating H264 rtppay
WARNING: Overriding infer-config batch-size 0  with number of sources  1  

Adding elements to Pipeline 

deepstream_test1_rtsp_in_rtsp_out.py:360: PyGIDeprecationWarning: GObject.MainLoop is deprecated; use GLib.MainLoop instead
  loop = GObject.MainLoop()

 *** DeepStream: Launched RTSP Streaming at rtsp://localhost:8554/ds-test ***


Starting pipeline 

INFO: infer_grpc_backend.cpp:164 TritonGrpcBackend id:5 initialized for model: main_model
Decodebin child added: source 

Decodebin child added: decodebin0 

Decodebin child added: qtdemux0 

Decodebin child added: multiqueue0 

Decodebin child added: h264parse0 

Decodebin child added: capsfilter0 

Decodebin child added: aacparse0 

Decodebin child added: avdec_aac0 

Decodebin child added: nvv4l2decoder0 

In cb_newpad

gstname= video/x-raw
features= <Gst.CapsFeatures object at 0x7fd58094f700 (GstCapsFeatures at 0x1367e80)>
In cb_newpad

gstname= audio/x-raw
W0520 01:16:47.079687 44 metrics.cc:395] Unable to get power limit for GPU 0: Success
End-of-stream
Exiting
Signal (15) received.
I0520 01:17:34.258507 44 server.cc:234] Waiting for in-flight requests to complete.
I0520 01:17:34.258525 44 model_repository_manager.cc:1078] unloading: main_model:1
I0520 01:17:34.258634 44 server.cc:249] Timeout 30: Found 1 live models and 0 in-flight non-inference requests
I0520 01:17:34.263544 44 logging.cc:49] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 1766, GPU 1408 (MiB)

It should not be related with this option --verbose-log
I verified without this option, also observed same error.
It seems that we should provide the engine file, since the engine you provided is not applicable for us, hence met the error. can you observe the memory leak issue with builtin model Primary_Detector resnet10.caffemodel?

Source is file:///opt/nvidia/deepstream/deepstream/samples/streams/sample_1080p_h264.mp4
Starting Triton…
I0520 08:41:33.618257 1729 metrics.cc:290] Collecting metrics for GPU 0: Tesla T4
I0520 08:41:33.986848 1729 libtorch.cc:1029] TRITONBACKEND_Initialize: pytorch
I0520 08:41:33.986883 1729 libtorch.cc:1039] Triton TRITONBACKEND API version: 1.4
I0520 08:41:33.986892 1729 libtorch.cc:1045] ‘pytorch’ TRITONBACKEND API version: 1.4
2022-05-20 17:41:34.173451: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
I0520 08:41:34.226197 1729 tensorflow.cc:2169] TRITONBACKEND_Initialize: tensorflow
I0520 08:41:34.226231 1729 tensorflow.cc:2179] Triton TRITONBACKEND API version: 1.4
I0520 08:41:34.226237 1729 tensorflow.cc:2185] ‘tensorflow’ TRITONBACKEND API version: 1.4
I0520 08:41:34.226243 1729 tensorflow.cc:2209] backend configuration:
{}
I0520 08:41:34.228962 1729 onnxruntime.cc:1970] TRITONBACKEND_Initialize: onnxruntime
I0520 08:41:34.228982 1729 onnxruntime.cc:1980] Triton TRITONBACKEND API version: 1.4
I0520 08:41:34.228988 1729 onnxruntime.cc:1986] ‘onnxruntime’ TRITONBACKEND API version: 1.4
I0520 08:41:34.252625 1729 openvino.cc:1193] TRITONBACKEND_Initialize: openvino
I0520 08:41:34.252642 1729 openvino.cc:1203] Triton TRITONBACKEND API version: 1.4
I0520 08:41:34.252648 1729 openvino.cc:1209] ‘openvino’ TRITONBACKEND API version: 1.4
I0520 08:41:34.671360 1729 pinned_memory_manager.cc:240] Pinned memory pool is created at ‘0x7f4ff8000000’ with size 268435456
I0520 08:41:34.673622 1729 cuda_memory_manager.cc:105] CUDA memory pool is created on device 0 with size 67108864
I0520 08:41:34.679040 1729 model_repository_manager.cc:1045] loading: main_model:1
I0520 08:41:35.488369 1729 logging.cc:49] [MemUsageChange] Init CUDA: CPU +320, GPU +0, now: CPU 591, GPU 842 (MiB)
I0520 08:41:35.489708 1729 logging.cc:49] Loaded engine size: 151 MB
I0520 08:41:35.489778 1729 logging.cc:49] [MemUsageSnapshot] deserializeCudaEngine begin: CPU 591 MiB, GPU 842 MiB
E0520 08:41:35.905863 1729 logging.cc:43] 6: The engine plan file is generated on an incompatible device, expecting compute 7.5 got compute 8.6, please rebuild.
E0520 08:41:35.906119 1729 logging.cc:43] 4: [runtime.cpp::deserializeCudaEngine::75] Error Code 4: Internal Error (Engine deserialization failed.)
E0520 08:41:35.913520 1729 model_repository_manager.cc:1215] failed to load ‘main_model’ version 1: Internal: unable to create TensorRT engine
I0520 08:41:35.913885 1729 server.cc:504]
±-----------------±-----+
| Repository Agent | Path |
±-----------------±-----+
±-----------------±-----+

I0520 08:41:35.914111 1729 server.cc:543]
±------------±----------------------------------------------------------------±-------+
| Backend | Path | Config |
±------------±----------------------------------------------------------------±-------+
| tensorrt | | {} |
| pytorch | /opt/tritonserver/backends/pytorch/libtriton_pytorch.so | {} |
| tensorflow | /opt/tritonserver/backends/tensorflow1/libtriton_tensorflow1.so | {} |
| onnxruntime | /opt/tritonserver/backends/onnxruntime/libtriton_onnxruntime.so | {} |
| openvino | /opt/tritonserver/backends/openvino/libtriton_openvino.so | {} |
±------------±----------------------------------------------------------------±-------+

I0520 08:41:35.914228 1729 server.cc:586]
±-----------±--------±--------------------------------------------------------+
| Model | Version | Status |
±-----------±--------±--------------------------------------------------------+
| main_model | 1 | UNAVAILABLE: Internal: unable to create TensorRT engine |
±-----------±--------±--------------------------------------------------------+

I0520 08:41:35.914592 1729 tritonserver.cc:1718]
±---------------------------------±---------------------------------------------------------------------------------------------------------------------------------------+
| Option | Value |
±---------------------------------±---------------------------------------------------------------------------------------------------------------------------------------+
| server_id | triton |
| server_version | 2.13.0 |
| server_extensions | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_configuration system_shared_memory |
| | cuda_shared_memory binary_tensor_data statistics |
| model_repository_path[0] | /workspace/triton_server/ |
| model_control_mode | MODE_NONE |
| strict_model_config | 1 |
| pinned_memory_pool_byte_size | 268435456 |
| cuda_memory_pool_byte_size{0} | 67108864 |
| min_supported_compute_capability | 6.0 |
| strict_readiness | 1 |
| exit_timeout | 30 |
±---------------------------------±---------------------------------------------------------------------------------------------------------------------------------------+

I0520 08:41:35.914664 1729 server.cc:234] Waiting for in-flight requests to complete.
I0520 08:41:35.914689 1729 server.cc:249] Timeout 30: Found 0 live models and 0 in-flight non-inference requests
error: creating server: Internal - failed to load all models
Failed to start Triton

I have made a compute capability 7.5 engine available here, I could also provide the ONNX model if necessary.
I haven’t tried using other models.

UNAVAILABLE: Invalid argument: model ‘main_model_0_0_gpu0’, tensor ‘confs’: the model expects 3 dimensions (shape [1,16128,80]) but the model configuration specifies 3 dimensions (an initial batch dimension because max_batch_size > 0 followed by the explicit tensor shape, making complete shape [-1,16128,2])

if i change
dims: [16128, 2]
to
dims: [16128, 80]
loading model failure gone, but got another issue.
features= <Gst.CapsFeatures object at 0x7fb59f4af5e0 (GstCapsFeatures at 0x7fb4f802e9a0)>
scores.inferDims.d[1]:80
python3: nvdsparsebbox_Yolo.cpp:145: bool NvDsInferParseCustomYoloV4(const std::vector&, const NvDsInferNetworkInfo&, const NvDsInferParseDetectionParams&, std::vector&): Assertion `detectionParams.numClassesConfigured == scores.inferDims.d[1]’ failed.
Aborted (core dumped)

Starting Triton…
I0524 07:48:22.406852 19259 metrics.cc:290] Collecting metrics for GPU 0: Tesla T4
I0524 07:48:22.770141 19259 libtorch.cc:1029] TRITONBACKEND_Initialize: pytorch
I0524 07:48:22.770183 19259 libtorch.cc:1039] Triton TRITONBACKEND API version: 1.4
I0524 07:48:22.770189 19259 libtorch.cc:1045] ‘pytorch’ TRITONBACKEND API version: 1.4
HTTPConnectionPool(host=‘localhost’, port=8000): Max retries exceeded with url: /v2/models/main_model/ready (Caused by NewConnectionError(‘<urllib3.connection.HTTPConnection object at 0x7eff4df86100>: Failed to establish a new connection: [Errno 111] Connection refused’))
2022-05-24 16:48:22.950463: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
I0524 07:48:23.003441 19259 tensorflow.cc:2169] TRITONBACKEND_Initialize: tensorflow
I0524 07:48:23.003476 19259 tensorflow.cc:2179] Triton TRITONBACKEND API version: 1.4
I0524 07:48:23.003484 19259 tensorflow.cc:2185] ‘tensorflow’ TRITONBACKEND API version: 1.4
I0524 07:48:23.003495 19259 tensorflow.cc:2209] backend configuration:
{}
I0524 07:48:23.005835 19259 onnxruntime.cc:1970] TRITONBACKEND_Initialize: onnxruntime
I0524 07:48:23.005868 19259 onnxruntime.cc:1980] Triton TRITONBACKEND API version: 1.4
I0524 07:48:23.005876 19259 onnxruntime.cc:1986] ‘onnxruntime’ TRITONBACKEND API version: 1.4
I0524 07:48:23.029609 19259 openvino.cc:1193] TRITONBACKEND_Initialize: openvino
I0524 07:48:23.029637 19259 openvino.cc:1203] Triton TRITONBACKEND API version: 1.4
I0524 07:48:23.029643 19259 openvino.cc:1209] ‘openvino’ TRITONBACKEND API version: 1.4
I0524 07:48:23.749998 19259 pinned_memory_manager.cc:240] Pinned memory pool is created at ‘0x7f0eb0000000’ with size 268435456
I0524 07:48:23.752780 19259 cuda_memory_manager.cc:105] CUDA memory pool is created on device 0 with size 67108864
I0524 07:48:23.759485 19259 model_repository_manager.cc:1045] loading: main_model:1
HTTPConnectionPool(host=‘localhost’, port=8000): Max retries exceeded with url: /v2/models/main_model/ready (Caused by NewConnectionError(‘<urllib3.connection.HTTPConnection object at 0x7eff4df86910>: Failed to establish a new connection: [Errno 111] Connection refused’))
I0524 07:48:24.910203 19259 logging.cc:49] [MemUsageChange] Init CUDA: CPU +320, GPU +0, now: CPU 577, GPU 5768 (MiB)
I0524 07:48:24.911742 19259 logging.cc:49] Loaded engine size: 137 MB
I0524 07:48:24.911829 19259 logging.cc:49] [MemUsageSnapshot] deserializeCudaEngine begin: CPU 577 MiB, GPU 5768 MiB
HTTPConnectionPool(host=‘localhost’, port=8000): Max retries exceeded with url: /v2/models/main_model/ready (Caused by NewConnectionError(‘<urllib3.connection.HTTPConnection object at 0x7eff4dee7100>: Failed to establish a new connection: [Errno 111] Connection refused’))
W0524 07:48:25.369252 19259 logging.cc:46] Using an engine plan file across different models of devices is not recommended and is likely to affect performance or even cause errors.
HTTPConnectionPool(host=‘localhost’, port=8000): Max retries exceeded with url: /v2/models/main_model/ready (Caused by NewConnectionError(‘<urllib3.connection.HTTPConnection object at 0x7eff4dee78b0>: Failed to establish a new connection: [Errno 111] Connection refused’))
I0524 07:48:26.625039 19259 logging.cc:49] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +491, GPU +212, now: CPU 1086, GPU 6108 (MiB)
HTTPConnectionPool(host=‘localhost’, port=8000): Max retries exceeded with url: /v2/models/main_model/ready (Caused by NewConnectionError(‘<urllib3.connection.HTTPConnection object at 0x7eff4def10a0>: Failed to establish a new connection: [Errno 111] Connection refused’))
I0524 07:48:27.590636 19259 logging.cc:49] [MemUsageChange] Init cuDNN: CPU +287, GPU +200, now: CPU 1373, GPU 6308 (MiB)
I0524 07:48:27.592647 19259 logging.cc:49] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 1373, GPU 6290 (MiB)
I0524 07:48:27.592764 19259 logging.cc:49] [MemUsageSnapshot] deserializeCudaEngine end: CPU 1373 MiB, GPU 6290 MiB
I0524 07:48:27.592777 19259 plan_backend.cc:456] Creating instance main_model_0_0_gpu0 on GPU 0 (7.5) using model.plan
I0524 07:48:27.602969 19259 logging.cc:49] [MemUsageSnapshot] ExecutionContext creation begin: CPU 1373 MiB, GPU 6290 MiB
I0524 07:48:27.605845 19259 logging.cc:49] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +10, now: CPU 1373, GPU 6300 (MiB)
I0524 07:48:27.608389 19259 logging.cc:49] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 1373, GPU 6308 (MiB)
I0524 07:48:27.611137 19259 logging.cc:49] [MemUsageSnapshot] ExecutionContext creation end: CPU 1374 MiB, GPU 6484 MiB
I0524 07:48:27.615528 19259 logging.cc:49] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 1373, GPU 6340 (MiB)
E0524 07:48:27.650387 19259 model_repository_manager.cc:1215] failed to load ‘main_model’ version 1: Invalid argument: model ‘main_model_0_0_gpu0’, tensor ‘confs’: the model expects 3 dimensions (shape [1,16128,80]) but the model configuration specifies 3 dimensions (an initial batch dimension because max_batch_size > 0 followed by the explicit tensor shape, making complete shape [-1,16128,2])
I0524 07:48:27.650732 19259 server.cc:504]
±-----------------±-----+
| Repository Agent | Path |
±-----------------±-----+
±-----------------±-----+

I0524 07:48:27.650954 19259 server.cc:543]
±------------±----------------------------------------------------------------±-------+
| Backend | Path | Config |
±------------±----------------------------------------------------------------±-------+
| tensorrt | | {} |
| pytorch | /opt/tritonserver/backends/pytorch/libtriton_pytorch.so | {} |
| tensorflow | /opt/tritonserver/backends/tensorflow1/libtriton_tensorflow1.so | {} |
| onnxruntime | /opt/tritonserver/backends/onnxruntime/libtriton_onnxruntime.so | {} |
| openvino | /opt/tritonserver/backends/openvino/libtriton_openvino.so | {} |
±------------±----------------------------------------------------------------±-------+

I0524 07:48:27.651088 19259 server.cc:586]
±-----------±--------±--------------------------------------------------------------------------------------------------------------------------------------------------+
| Model | Version | Status |
±-----------±--------±--------------------------------------------------------------------------------------------------------------------------------------------------+
| main_model | 1 | UNAVAILABLE: Invalid argument: model ‘main_model_0_0_gpu0’, tensor ‘confs’: the model expects 3 dimensions (shape [1,16128,80]) but the model con |
| | | figuration specifies 3 dimensions (an initial batch dimension because max_batch_size > 0 followed by the explicit tensor shape, making complete s |
| | | hape [-1,16128,2]) |
±-----------±--------±--------------------------------------------------------------------------------------------------------------------------------------------------+

I0524 07:48:27.651358 19259 tritonserver.cc:1718]
±---------------------------------±---------------------------------------------------------------------------------------------------------------------------------------+
| Option | Value |
±---------------------------------±---------------------------------------------------------------------------------------------------------------------------------------+
| server_id | triton |
| server_version | 2.13.0 |
| server_extensions | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_configuration system_shared_memory |
| | cuda_shared_memory binary_tensor_data statistics |
| model_repository_path[0] | /workspace/triton_server/ |
| model_control_mode | MODE_NONE |
| strict_model_config | 1 |
| pinned_memory_pool_byte_size | 268435456 |
| cuda_memory_pool_byte_size{0} | 67108864 |
| min_supported_compute_capability | 6.0 |
| strict_readiness | 1 |
| exit_timeout | 30 |
±---------------------------------±---------------------------------------------------------------------------------------------------------------------------------------+

I0524 07:48:27.651426 19259 server.cc:234] Waiting for in-flight requests to complete.
I0524 07:48:27.651454 19259 server.cc:249] Timeout 30: Found 0 live models and 0 in-flight non-inference requests
error: creating server: Internal - failed to load all models

I apologize, it seems I used the incorrect ONNX file to build the engine when using a different environment. I have updated the file at the same link and put up the ONNX model here, just in case.

It run well finally.
but i can not get memory leak. i used script from DeepStream SDK FAQ - Intelligent Video Analytics / DeepStream SDK - NVIDIA Developer Forums to check HW & SW Memory log.
Can you specicy how you check memory status?

|PID: 24300 18:43:48|Hardware memory: Total: 0.0000 KiB Free: 0.0000 KiB Client: 0.0000 KiB|VmSize: 31.9844 MiB|VmRSS: 22.1758 MiB|RssFile: 10.2617 MiB RssAnon: 11.9141 MiB|lsof: 0|
|PID: 24300 19:45:47|Hardware memory: Total: 0.0000 KiB Free: 0.0000 KiB Client: 0.0000 KiB|VmSize: 31.9844 MiB|VmRSS: 22.1758 MiB|RssFile: 10.2617 MiB||

I check by verifying how much memory the docker container as a whole is using, with a RTSP stream as input so I can check over a long period of time, the results for about one hour of running are attached in stats.txt, and memory usage increases over time.

stats.txt (54.2 KB)

Above log getting from script nvmemstat.py which is for Jetson. sorry about that, we can ignore Hardware memory, just see VmRSS, which is for SW memory, it constantly unchanged during the test. i captured the GPU memory also running the test again, as attached, from the log, you can see the memory will increase to 3926 in around 10 seconds, then it constantly stay at this value.
NOTE: there one another process occupying 2081M, so GPU memory taken by i run is around 1845Mb.
log (5.5 MB)

I have tried using nvmemstat and with a RTSP input, it’s only a few minutes of data but VmRSS is trending upwards. I have had no issues with GPU memory, it’s only the overall host memory usage that keeps increasing. I’m not sure anymore if the issue is using Triton/GRPC or if it’s the RTSP input though, since you don’t seem to observe memory usage increase with the sample video file, although it might just be too short to notice any issue, as the memory usage increase doesn’t seem to be constant, i.e. not for every frame.
nvmemstat_out.txt (65.3 KB)

edit: on a separate note, I’ve updated my actual application to DeepStream 6.1 but it didn’t seem to improve things.

I setup one rtspstream, but i still can not get memory leak, but i just run around 6 minutes, is that enough?
first:
PID: 7832 18:10:18 Hardware memory: Total: 0.0000 KiB Free: 0.0000 KiB Client: 0.0000 KiB VmSize: 31.9844 MiB VmRSS: 22.0859 MiB RssFile: 10.1758 MiB RssAnon: 11.9102 MiB lsof: 0
after around 6 minutes:
PID: 7832 18:16:20 Hardware memory: Total: 0.0000 KiB Free: 0.0000 KiB Client: 0.0000 KiB VmSize: 31.9844 MiB VmRSS: 22.0859 MiB RssFile: 10.1758 MiB RssAnon: 11.9102 MiB lsof: 0

In my tests it should be enough time to see some change in VmRSS. Is that the correct process? The amount of memory used looks very different from my logs, when I ran nvmemstat with -p all there were multiple processes that contained “python3 deepstream_test1_rtsp_in_rtsp_out.py” in the name, but only one was the actual Python process.

Oh, i used start_test.py. sorry, should be deepstream_test1_rtsp_in_rtsp_out.py

Host memory leak was not only with tritonserver, but also with nvinfer. need to dig into which component or if the app caused the mem leak.

this is log after change nvinferserver to nvinfer:
first:
PID: 246 11:45:29 Hardware memory: Total: 0.0000 KiB Free: 0.0000 KiB Client: 0.0000 KiB VmSize: 11.1250 MiB VmRSS: 4.0469 MiB RssFile: 3.6914 MiB RssAnon: 364.0000 KiB lsof: 0
PID: 14458 11:45:29 Hardware memory: Total: 0.0000 KiB Free: 0.0000 KiB Client: 0.0000 KiB VmSize: 9983.1641 MiB VmRSS: 1750.4805 MiB RssFile: 530.6133 MiB

after around 50 minutes:
PID: 246 12:36:54 Hardware memory: Total: 0.0000 KiB Free: 0.0000 KiB Client: 0.0000 KiB VmSize: 11.1250 MiB VmRSS: 4.0469 MiB RssFile: 3.6914 MiB RssAnon: 364.0000 KiB lsof: 0
PID: 14458 12:36:54 Hardware memory: Total: 0.0000 KiB Free: 0.0000 KiB Client: 0.0000 KiB VmSize: 9983.1641 MiB VmRSS: 1771.2422 MiB RssFile: 530.6133 MiB RssAnon: 0.0000 KiB lsof: 0

Thank you for verifying, at least this means it’s not an issue with my environment.

I have also noticed that it’s not completely because of Triton, but switching to nvinfer seemed to lower the leak amount for my custom application. With nvinferserver + grpc it leaked about 200mb/5 minutes for 8 parallel processes, but with nvinfer it got down to about 200mb/2 hours.

In the end I couldn’t pinpoint the exact cause, so I’ll probably end up restarting the application container once a day, but if Nvidia could find a fix or workaround it’d be greatly appreciated.