TensorRT model memory usage in NvInfer vs NvInferserver plugin

congminh9812 · June 7, 2023, 3:16am

Please provide complete information as applicable to your setup.

• Hardware Platform (Jetson / GPU)
NVIDIA RTX 2060
• DeepStream Version
6.2 (from the Docker Deepstream 6.2 Image)
• JetPack Version (valid for Jetson only)
• TensorRT Version
8.5.2.2
• NVIDIA GPU Driver Version (valid for GPU only)
525.105.17
• Issue Type( questions, new requirements, bugs)
We are currently migrating our models from NvInfer plugin to NvInferServer plugin (for some specific data processing using Python backend). We have noticed that the GPU memory consumption between NvInfer vs NvInferServer is not consistent, even though we are using the same TensorRT model and similar configuration (same batch size = 30, same image preprocessing, same custom lib, same input dims=3x768x768).

To be more specific, when we use NvInfer plugin to load our model, our memory consumption is about 1194 MB. When we switch to NvInferServer for the same model, our memory consumption increased (2124 MB, about double the memory usage), even though our configuration is similar. The image attached below shows the difference between the plugin usage.

Is there different behaviour in how NvInfer works compared to how the Triton Server loads the TensorRT backend that would incur different memory usage?

• How to reproduce the issue ? (This is for bugs. Including which sample app is using, the configuration files content, the command line used and other details for reproducing)
We are using a typical ONNX base model that we cannot disclose, but any ONNX-based model should work. We then convert the model using NvInfer built-in conversion when we run the pipeline. After that, we copied the generated engine to the Triton Server model repository and used NvInferServer to call to the TensorRT model. The attached configuration file include all 3 needed configs to reproduce how we ran the model in the pipeline (sorry for putting everything in 1 file, new user only get to upload 1 link):
pgie_configs.txt (2.4 KB)

• Requirement details( This is for new requirement. Including the module name-for which plugin or for which sample application, the function description)

Fiona.Chen · June 7, 2023, 4:46am

gst-nvinferserver is based on the Triton Inference Server while gst-nvinfer is based on the TensorRT. Triton Inference Server and TensorRT are different SDKs with different architectures.

windy60j34 · June 13, 2023, 7:08pm

They are from different architecture, there might be some slight difference.

For your case, please check:

make sure gst-nvinfer and gst-nvinferserver point to the same TRT engine file. gst-nvinfer may rebuild the model online and gst-nvinferserver need the model pre-built offline. If they are not pointing to same engine file. perf/utils are different and not comparable. Make sure the engine is exactly with same batch-size/precision. e.g.

sample_apps/deepstream-test/dstest1_pgie_config.yml

model-engine-file: /opt/nvidia/deepstream/deepstream-6.2/samples/triton_model_repo/Primary_Detector/1/resnet10.caffemodel_b30_gpu0_int8.engine

make sure the 2 plugin’s config files specifying same max_batch_size. this would allocate large preprocessing buffer pool.

sample_apps/deepstream-test/dstest1_pgie_config.yml => batch-size: 1
sample_apps/deepstream-test1/dstest1_pgie_nvinferserver_config.txt => max_batch_size: 1

Triton will also reserve some small mem for performance by default. If your model don’t need it, you can reduce or disable them or set to 0. but this might affect your perf

39 model_repo {
40 root: “…/…/…/…/samples/triton_model_repo”
41 strict_model_config: true
42 cuda_device_memory {
43 device: 0
44 memory_pool_byte_size: 32000000 # 32MB
45 }
46 pinned_memory_pool_byte_size: 32000000 # 32MB
47 }
48 }
49 }

fanzh · June 16, 2023, 7:42am

are you still checking this issue?

we can’t reproduce this issue by deeepstream-test1 with resnet10 or yolov4. if seting batch30 and using the same engine, the GPU memory usage differnce is not too much, far less than 1G.
can you use deeepstream-test1 and a public model to reproduce this issue? if you can still reprodcue, please provoide the application, model and whole configuration.

congminh9812 · June 21, 2023, 7:35am

Hey there, sorry for the late reply, we forgot to check for further replies after a while. We have reproduced the problem using the resnet10 model in the DeepStream SDK and it seems the issue scales with the batch size used in the configuration.

Our steps in reproducing the issue using a public model: we used the ResNet10 Caffe Model (PrimaryDetector) provided in the DeepStream 6.2 Docker container and the deepstream-test1 for reproduction.

First we set the model batch size to 128 and let the plugin generate the model engine, then ran the pipeline.
We then copied the engine file to the PrimaryDetector folder in triton_model_repo and set the configuration to point to the engine file, with the batch size set at 128. We then ran the pipeline using nvinferserver configuration.

The image attached below shows the difference between using nvinfer and nvinferserver: GPU usage difference is almost 1G.

We only modified the batch size in the configuration for the model:

NvInfer:

property:
  gpu-id: 0
  model-file: ../../../../samples/models/Primary_Detector/resnet10.caffemodel
  proto-file: ../../../../samples/models/Primary_Detector/resnet10.prototxt
  model-engine-file: ../../../../samples/models/Primary_Detector/resnet10.caffemodel_b128_gpu0_int8.engine
  labelfile-path: ../../../../samples/models/Primary_Detector/labels.txt
  int8-calib-file: ../../../../samples/models/Primary_Detector/cal_trt.bin
  batch-size: 128
...

NvInferServer:

# dstest1_pgie_nvinferserver_config.txt in deepstream-test1
infer_config {
  unique_id: 1
  gpu_ids: [0]
  max_batch_size: 128
# config.pbtxt
name: "Primary_Detector"
platform: "tensorrt_plan"
max_batch_size: 128
default_model_filename: "resnet10.caffemodel_b128_gpu0_int8.engine"

Since we are in need of using models with high batch sizes but are restrained by memory limit, we will stick to the NvInfer plugin for now. Thank you for the instructions.

system · July 10, 2023, 3:28am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Model run using nvinferserver occupying high GPU memory-usage DeepStream SDK	10	1469	October 12, 2021
Performance about nvinfer and nvinferserver DeepStream SDK	6	1581	March 22, 2022
Nvinfer vs nvinferserver inference times with increasing image resolution DeepStream SDK deepstream61	4	1403	July 26, 2022
Performance Comparison: TensorRT C++ API vs. DeepStream nvinfer vs. nvinferserver on Jetson DeepStream SDK gstreamer , deepstream	3	163	May 27, 2025
Using nvinfer(TensorRT) or nvinferserver on deepstream DeepStream SDK	4	1087	October 5, 2021
Nvinferserver (Triton server) doesn't improves inference FPS for dynamic batching models DeepStream SDK	2	404	October 25, 2023
Excess Memory-copy in standalone triton server. Deepstream--Triton server(grpc) DeepStream SDK deepstream	14	44	January 12, 2026
Nvinferserver DeepStream SDK	13	1616	October 12, 2021
Problem with accumulating gpu memory usage in tritonserver TensorRT cudnn , inference-server-triton , deepstream	0	230	September 3, 2024
No Tensor Data from nvinferserver when using ReIdentificationNet from Triton DeepStream SDK	8	449	August 31, 2023

TensorRT model memory usage in NvInfer vs NvInferserver plugin

Related topics