TensorRT model memory usage in NvInfer vs NvInferserver plugin

Please provide complete information as applicable to your setup.

• Hardware Platform (Jetson / GPU)
NVIDIA RTX 2060
• DeepStream Version
6.2 (from the Docker Deepstream 6.2 Image)
• JetPack Version (valid for Jetson only)
• TensorRT Version
8.5.2.2
• NVIDIA GPU Driver Version (valid for GPU only)
525.105.17
• Issue Type( questions, new requirements, bugs)
We are currently migrating our models from NvInfer plugin to NvInferServer plugin (for some specific data processing using Python backend). We have noticed that the GPU memory consumption between NvInfer vs NvInferServer is not consistent, even though we are using the same TensorRT model and similar configuration (same batch size = 30, same image preprocessing, same custom lib, same input dims=3x768x768).

To be more specific, when we use NvInfer plugin to load our model, our memory consumption is about 1194 MB. When we switch to NvInferServer for the same model, our memory consumption increased (2124 MB, about double the memory usage), even though our configuration is similar. The image attached below shows the difference between the plugin usage.

Is there different behaviour in how NvInfer works compared to how the Triton Server loads the TensorRT backend that would incur different memory usage?

• How to reproduce the issue ? (This is for bugs. Including which sample app is using, the configuration files content, the command line used and other details for reproducing)
We are using a typical ONNX base model that we cannot disclose, but any ONNX-based model should work. We then convert the model using NvInfer built-in conversion when we run the pipeline. After that, we copied the generated engine to the Triton Server model repository and used NvInferServer to call to the TensorRT model. The attached configuration file include all 3 needed configs to reproduce how we ran the model in the pipeline (sorry for putting everything in 1 file, new user only get to upload 1 link):
pgie_configs.txt (2.4 KB)

• Requirement details( This is for new requirement. Including the module name-for which plugin or for which sample application, the function description)

gst-nvinferserver is based on the Triton Inference Server while gst-nvinfer is based on the TensorRT. Triton Inference Server and TensorRT are different SDKs with different architectures.

They are from different architecture, there might be some slight difference.

For your case, please check:

  • make sure gst-nvinfer and gst-nvinferserver point to the same TRT engine file. gst-nvinfer may rebuild the model online and gst-nvinferserver need the model pre-built offline. If they are not pointing to same engine file. perf/utils are different and not comparable. Make sure the engine is exactly with same batch-size/precision. e.g.

sample_apps/deepstream-test/dstest1_pgie_config.yml

model-engine-file: /opt/nvidia/deepstream/deepstream-6.2/samples/triton_model_repo/Primary_Detector/1/resnet10.caffemodel_b30_gpu0_int8.engine

  • make sure the 2 plugin’s config files specifying same max_batch_size. this would allocate large preprocessing buffer pool.

sample_apps/deepstream-test/dstest1_pgie_config.yml => batch-size: 1
sample_apps/deepstream-test1/dstest1_pgie_nvinferserver_config.txt => max_batch_size: 1

  • Triton will also reserve some small mem for performance by default. If your model don’t need it, you can reduce or disable them or set to 0. but this might affect your perf

39 model_repo {
40 root: “…/…/…/…/samples/triton_model_repo”
41 strict_model_config: true
42 cuda_device_memory {
43 device: 0
44 memory_pool_byte_size: 32000000 # 32MB
45 }
46 pinned_memory_pool_byte_size: 32000000 # 32MB
47 }
48 }
49 }

are you still checking this issue?

  1. we can’t reproduce this issue by deeepstream-test1 with resnet10 or yolov4. if seting batch30 and using the same engine, the GPU memory usage differnce is not too much, far less than 1G.
  2. can you use deeepstream-test1 and a public model to reproduce this issue? if you can still reprodcue, please provoide the application, model and whole configuration.

Hey there, sorry for the late reply, we forgot to check for further replies after a while. We have reproduced the problem using the resnet10 model in the DeepStream SDK and it seems the issue scales with the batch size used in the configuration.

Our steps in reproducing the issue using a public model: we used the ResNet10 Caffe Model (PrimaryDetector) provided in the DeepStream 6.2 Docker container and the deepstream-test1 for reproduction.

  1. First we set the model batch size to 128 and let the plugin generate the model engine, then ran the pipeline.
  2. We then copied the engine file to the PrimaryDetector folder in triton_model_repo and set the configuration to point to the engine file, with the batch size set at 128. We then ran the pipeline using nvinferserver configuration.

The image attached below shows the difference between using nvinfer and nvinferserver: GPU usage difference is almost 1G.

We only modified the batch size in the configuration for the model:

  • NvInfer:
property:
  gpu-id: 0
  model-file: ../../../../samples/models/Primary_Detector/resnet10.caffemodel
  proto-file: ../../../../samples/models/Primary_Detector/resnet10.prototxt
  model-engine-file: ../../../../samples/models/Primary_Detector/resnet10.caffemodel_b128_gpu0_int8.engine
  labelfile-path: ../../../../samples/models/Primary_Detector/labels.txt
  int8-calib-file: ../../../../samples/models/Primary_Detector/cal_trt.bin
  batch-size: 128
...
  • NvInferServer:
# dstest1_pgie_nvinferserver_config.txt in deepstream-test1
infer_config {
  unique_id: 1
  gpu_ids: [0]
  max_batch_size: 128
# config.pbtxt
name: "Primary_Detector"
platform: "tensorrt_plan"
max_batch_size: 128
default_model_filename: "resnet10.caffemodel_b128_gpu0_int8.engine"

Since we are in need of using models with high batch sizes but are restrained by memory limit, we will stick to the NvInfer plugin for now. Thank you for the instructions.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.