Videosink playback having hiccups when setting nvinferserver plugin with large interval

Please provide complete information as applicable to your setup.

• Hardware Platform (Jetson / GPU)
Jetson AGX Xavier and Jetson Xavier NX
• DeepStream Version
6.1.1
• JetPack Version (valid for Jetson only)
5.0.2
• Issue Type( questions, new requirements, bugs)
Playback not smooth when using nvinfer with large enough interval.
• How to reproduce the issue ? (This is for bugs. Including which sample app is using, the configuration files content, the command line used and other details for reproducing)

gst-launch-1.0 -vvv uridecodebin uri=file:///pipeline/rawdata/CALIB-000044-01.MP4 ! \
m.sink_0 uridecodebin uri=file:///pipeline/rawdata/CALIB-000044-02.MP4 ! \
m.sink_1 nvstreammux name=m batch-size=2 config-file-path=/pipeline/configs/nvstreammux-config.txt ! \
nvmultistreamtiler rows=1 columns=2 width=3840 height=1080 ! \
tee name=t ! queue ! nvvideoconvert src-crop=0:0:1920:1080 ! "video/x-raw(memory:NVMM), format=NV12" ! \
m2.sink_0 t. ! queue ! nvvideoconvert src-crop=1920:0:1920:1080 ! "video/x-raw(memory:NVMM), format=NV12" ! \
m2.sink_1 nvstreammux name=m2 batch-size=2 config-file-path=/pipeline/configs/nvstreammux-config.txt ! \
nvvideoconvert output-buffers=30 ! capsfilter ! "video/x-raw(memory:NVMM), format=RGBA" ! \
nvinferserver batch-size=2 config-file-path=/pipeline/configs/triton-detector-config.txt ! \
queue2 ! \
queue min-threshold-buffers=5 ! \
nvmultistreamtiler rows=1 columns=2 width=1920 height=1080 ! \
nvvideoconvert ! nvdsosd ! nvegltransform ! nveglglessink sync=1

The pipeline above runs smoothly but sometimes has hiccups and if I change output-buffers to 50 it has major hiccups, QOS warnings and frame dropping.

The interval for nvinferserver is set to 15, inference takes about 200ms so there should be more than enough time to buffer the the queue for smooth playback. The pipeline runs without issue when removing the nvinferserver plugins but as soon as we add the nvinferserver with interval set to 15 playback isn’t smooth anymore, there is the sweet parameter where output-buffers=30 where it works nice for a while but ultimately has hiccups.

The timing between nvinferserver input and sink input is consistently around 1 second and that doesn’t change when it runs smoothly and when it doesn’t. I observed when it runs smoothly the gpu usage is between 30-80% whereas when its having hiccups the GPU usage goes to the extremes.

What describes this behavior that the pipeline can be running smooth and out of nowhere has a hiccup, or what would describe a simple change of output-buffers=50 would cause the pipeline go unstable?

1 Like

Seems 1 second / 200ms = 5 . Why do you say 1/15 second interval is enough for smooth playback?

What is your triton backend? Can you monitor the CPU and GPU usage while running the pipeline with hiccup?

The backend is Tensor RT runtime (plan engine file).

With interval=15 inference rate is about 2 fps or 500ms which is more than the time it takes to do the inference. Also output-buffers=30 allows 30 buffers to fill the last queue or 1 second worth of buffers (probing buf_lvl confirms 30 buffers are queued), only 15 buffers or 500ms worth of buffers should be needed to absorb the buffers between the inference calls.

The cpu usage is about 30% with the jtop program used to monitor the GPU. As mentioned the GPU usage is about 40-80% when it is ran smoothly and when it has hiccups the GPU usage is mostly 0% but jumps to 100%.

The configuration file is:

infer_config {
  unique_id: 5
  gpu_ids: [0]
  max_batch_size: 2
  backend {
    trt_is {
      model_name: "detector"
      version: -1
      model_repo {
        root: "/pipeline/model_repo"
        log_level: 2
      }
    }
  }

  preprocess {
    network_format: IMAGE_FORMAT_RGB
    tensor_order: TENSOR_ORDER_NONE
    maintain_aspect_ratio: 1
    frame_scaling_hw: FRAME_SCALING_HW_VIC
    normalize {
      scale_factor: 0.013998617031238794 #[0.01358644677154735, 0.01430803109413305, 0.014101373228035981]
      channel_offsets: [104.0136177, 114.0342201, 119.91659325]
    }
  }

  postprocess {
    labelfile_path: "/pipeline/model_repo/detector/labels.txt"
    other {}
  }

  extra {
    copy_input_to_host_buffers: false
  }

  custom_lib {
    path: "/opt/nvidia/deepstream/deepstream/lib/libnvds_infercustomparser.so"
  }
}
input_control {
  process_mode: PROCESS_MODE_FULL_FRAME
  interval: 15
}
output_control {
  output_tensor_meta: true
}

The model is generated this way:

trtexec --onnx=detector.onnx --fp16 --minShapes=input:1x3x448x800 --optShapes=input:2x3x448x800 --maxShapes=input:2x3x448x800 --workspace=4096 --saveEngine=detector.trt

Below is a matrix where this issue is reproduced.

Platform Deepstream Version Issue Reproduced?
Jetson Xavier NX 6.1.1 (Jetpack 5.0.2) Yes
Jetson Xavier AGX 6.1.1 (Jetpack 5.0.2) Yes
Jetson Xavier NX 6.0.1 (Jetpack 4.6.2) Yes
RTXA4000i7-10700 CPU @ 2.90GHz 6.1.1 No
Jetson Orin AGX* 6.1.1 (Jetpack 5.0.2) No

It seems the issue is happening when the inference takes much longer than a frame duration eventhough it takes less than an interval duration. Are there considerations for this case where we have:

frame duration << inference time  < interval time

are there limits we can set on how much of the GPU the NN can use? I am assuming the fumbling happens because the GPU is trying to be used for upstream tasks

Can you upload your nvstreammux configuration file? And currently we don’t support nvstreammux cascade.

Have you tried nvinfer in your pipeline instead of nvinferserver?

We have updated the pipeline to keep only one nvstreammux

gst-launch-1.0 -vvv uridecodebin uri=file:///rawdataCALIB-000048-01.MP4 ! \
 nvvideoconvert ! nvcomp.sink_0 uridecodebin uri=file:///rawdata/CALIB-000048-02.MP4 ! \
 nvvideoconvert ! nvcomp.sink_1 nvcompositor name=nvcomp sink_0::xpos=0 sink_0::ypos=0 \
 sink_0::width=1920 sink_0::height=1080 sink_1::xpos=1920 sink_1::ypos=0 sink_1::width=1920 sink_1::height=1080 ! \
 tee name=t ! queue ! nvvideoconvert src-crop=0:0:1920:1080 ! "video/x-raw(memory:NVMM), format=NV12" ! \
 m2.sink_0 t. ! queue ! nvvideoconvert src-crop=1920:0:1920:1080 ! "video/x-raw(memory:NVMM), format=NV12" ! \
 m2.sink_1 nvstreammux name=m2 batch-size=2 config-file-path=config/nvstreammux-config.txt ! \
 nvvideoconvert output-buffers=30 ! capsfilter ! "video/x-raw(memory:NVMM), format=RGBA" ! \
 nvinferserver batch-size=2 config-file-path=config/triton-detector-config.txt ! \
 queue min-threshold-buffers=5 ! \
 nvmultistreamtiler rows=1 columns=2 width=1920 height=1080 ! \
 nvvideoconvert ! nvdsosd ! nvegltransform ! nveglglessink sync=0

The nvstreammux-config.txt


[property]
algorithm-type=1
adaptive-batching=1
enable-source-rate-control=0
max-same-source-frames=1
overall-max-fps-n=30
overall-max-fps-d=1
overall-min-fps-n=25
overall-min-fps-d=1
max-fps-control=0

We will try with nvinfer and will post back here on the results.

The only change made here is replacing the nvinferserver plugin with nvinfer.

gst-launch-1.0 -vvv uridecodebin uri=file:///rawdata/CALIB-000044-01.MP4 ! \
    nvvideoconvert ! nvcomp.sink_0 uridecodebin uri=file:///rawdata/CALIB-000044-02.MP4 ! \
    nvvideoconvert ! nvcomp.sink_1 nvcompositor name=nvcomp sink_0::xpos=0 sink_0::ypos=0 \
    sink_0::width=1920 sink_0::height=1080 sink_1::xpos=1920 sink_1::ypos=0 sink_1::width=1920 sink_1::height=1080 ! \
    tee name=t ! queue ! nvvideoconvert src-crop=0:0:1920:1080 ! "video/x-raw(memory:NVMM), format=NV12" ! \
    m2.sink_0 t. ! queue ! nvvideoconvert src-crop=1920:0:1920:1080 ! "video/x-raw(memory:NVMM), format=NV12" ! \
    m2.sink_1 nvstreammux name=m2 batch-size=2 config-file-path=/config/nvstreammux-config.txt ! \
    nvvideoconvert output-buffers=30 ! capsfilter ! "video/x-raw(memory:NVMM), format=RGBA" ! \
    nvinfer batch-size=2 config-file-path=/config/detector-config.txt \
    model-engine-file=/model_repo/detector/1/detector-fp16.trt ! \
    queue min-threshold-buffers=5 ! \
    nvmultistreamtiler rows=1 columns=2 width=1920 height=1080 ! \
    nvvideoconvert ! nvdsosd ! nvegltransform ! nveglglessink sync=1

Aside from the previous improvements made since the beginning of this forum using nvinfer made further improvements. For the AGX Xavier we are able to drop the interval down to 9 and produce smooth output.

Interval Issue reproduced? (Nvinferserver) Issue reproduced? (Nvinfer)
10 No No
9 Yes No
8 Yes Yes

We are still wondering why would it still produce hiccups with interval=8, we see when we load and run the model using trtexec, the execution time is 90ms. Note we are running this with batch-size 2. Assuming the other plugins do not utilize the GPU too much further dropping the interval to 6 or 7 shouldn’t be an issue, is this a valid assumption?

# load and evaluate performance

/usr/src/tensorrt/bin/trtexec --loadEngine=detector-fp16.trt

[10/30/2022-05:57:04] [I] Trace averages of 10 runs:

[10/30/2022-05:57:04] [I] Average on 10 runs - GPU latency: 88.5116 ms - Host latency: 88.8958 ms (enqueue 3.07762 ms)

[10/30/2022-05:57:04] [I] Average on 10 runs - GPU latency: 88.7181 ms - Host latency: 89.0907 ms (enqueue 2.68256 ms)

[10/30/2022-05:57:04] [I] Average on 10 runs - GPU latency: 88.4477 ms - Host latency: 88.8301 ms (enqueue 2.589 ms)

[10/30/2022-05:57:04] [I]

[10/30/2022-05:57:04] [I] === Performance summary ===

[10/30/2022-05:57:04] [I] Throughput: 10.9439 qps

[10/30/2022-05:57:04] [I] Latency: min = 88.509 ms, max = 91.2951 ms, mean = 88.9378 ms, median = 88.606 ms, percentile(99%) = 91.2951 ms

[10/30/2022-05:57:04] [I] Enqueue Time: min = 2.29785 ms, max = 3.82034 ms, mean = 2.79621 ms, median = 2.75397 ms, percentile(99%) = 3.82034 ms

[10/30/2022-05:57:04] [I] H2D Latency: min = 0.269836 ms, max = 0.340118 ms, mean = 0.305797 ms, median = 0.30658 ms, percentile(99%) = 0.340118 ms

[10/30/2022-05:57:04] [I] GPU Compute Time: min = 88.1184 ms, max = 90.9471 ms, mean = 88.5569 ms, median = 88.2213 ms, percentile(99%) = 90.9471 ms

[10/30/2022-05:57:04] [I] D2H Latency: min = 0.0588379 ms, max = 0.0788574 ms, mean = 0.0751589 ms, median = 0.0753174 ms, percentile(99%) = 0.0788574 ms

[10/30/2022-05:57:04] [I] Total Host Walltime: 3.28952 s

[10/30/2022-05:57:04] [I] Total GPU Compute Time: 3.18805 s

[10/30/2022-05:57:04] [I] Explanations of the performance metrics are printed in the verbose logs.

I think we are narrowing this issue down, running with batch-size 2 takes about 180ms which would ideally allow us to run inference 5 fps or every 6 frames (interval=5).

I will provide further updates on what the performance would be when taking into account other GPU workloads.

usr/src/tensorrt/bin/trtexec --loadEngine=detector-fp16.trt --shapes=input:2x3x448x800 --fp16

[10/30/2022-20:09:38] [I] === Trace details ===
[10/30/2022-20:09:38] [I] Trace averages of 10 runs:
[10/30/2022-20:09:38] [I] Average on 10 runs - GPU latency: 173.852 ms - Host latency: 174.623 ms (enqueue 2.72653 ms)
[10/30/2022-20:09:38] [I] Average on 10 runs - GPU latency: 173.377 ms - Host latency: 174.061 ms (enqueue 2.31315 ms)
[10/30/2022-20:09:38] [I]
[10/30/2022-20:09:38] [I] === Performance summary ===
[10/30/2022-20:09:38] [I] Throughput: 5.42193 qps
[10/30/2022-20:09:38] [I] Latency: min = 172.029 ms, max = 182.471 ms, mean = 174.342 ms, median = 172.323 ms, percentile(99%) = 182.471 ms
[10/30/2022-20:09:38] [I] Enqueue Time: min = 1.98364 ms, max = 3.2677 ms, mean = 2.51984 ms, median = 2.39636 ms, percentile(99%) = 3.2677 ms
[10/30/2022-20:09:38] [I] H2D Latency: min = 0.520996 ms, max = 1.22608 ms, mean = 0.591442 ms, median = 0.560669 ms, percentile(99%) = 1.22608 ms
[10/30/2022-20:09:38] [I] GPU Compute Time: min = 171.318 ms, max = 181.78 ms, mean = 173.615 ms, median = 171.628 ms, percentile(99%) = 181.78 ms
[10/30/2022-20:09:38] [I] D2H Latency: min = 0.0991211 ms, max = 0.154785 ms, mean = 0.136017 ms, median = 0.136719 ms, percentile(99%) = 0.154785 ms
[10/30/2022-20:09:38] [I] Total Host Walltime: 3.68872 s
[10/30/2022-20:09:38] [I] Total GPU Compute Time: 3.47229 s
[10/30/2022-20:09:38] [W] * GPU compute time is unstable, with coefficient of variance = 1.76848%.
[10/30/2022-20:09:38] [W] If not already in use, locking GPU clock frequency or adding --useSpinWait may improve the stability.
[10/30/2022-20:09:38] [I] Explanations of the performance metrics are printed in the verbose logs.
[10/30/2022-20:09:38] [I]
&&&& PASSED TensorRT.trtexec [TensorRT v8401] # /usr/src/tensorrt/bin/trtexec --loadEngine=detector-fp16.trt --shapes=input:2x3x448x800 --fp16