Model ran much slower in deepstream pipeline

hyperlight · October 14, 2020, 10:44pm

Please provide complete information as applicable to your setup.

• Hardware Platform (Jetson / GPU): Jetson AGX Xavier
• DeepStream Version: 5.0 GA
• JetPack Version (valid for Jetson only): 4.4
• TensorRT Version: 7.1.3
• Issue Type( questions, new requirements, bugs): question

Measure model performance using trtexec:

#boost the clock
sudo nvpmodel -m 0 sudo jetson_clocks
#measure performance
/usr/src/tensorrt/bin/trtexec --loadEngine=./yolo_resnet18_epoch_008.etlt_b1_gpu0_fp16.engine

Output:

…
[10/14/2020-14:40:09] [I] === Inference Options ===
[10/14/2020-14:40:09] [I] Batch: 1
[10/14/2020-14:40:09] [I] Input inference shapes: model
[10/14/2020-14:40:09] [I] Iterations: 10
[10/14/2020-14:40:09] [I] Duration: 3s (+ 200ms warm up)
[10/14/2020-14:40:09] [I] Sleep time: 0ms
[10/14/2020-14:40:09] [I] Streams: 1
[10/14/2020-14:40:09] [I] ExposeDMA: Disabled
[10/14/2020-14:40:09] [I] Spin-wait: Disabled
[10/14/2020-14:40:09] [I] Multithreading: Disabled
[10/14/2020-14:40:09] [I] CUDA Graph: Disabled
[10/14/2020-14:40:09] [I] Skip inference: Disabled
[10/14/2020-14:40:09] [I] Inputs:
[10/14/2020-14:40:09] [I] === Reporting Options ===
[10/14/2020-14:40:09] [I] Verbose: Disabled
[10/14/2020-14:40:09] [I] Averages: 10 inferences
[10/14/2020-14:40:09] [I] Percentile: 99
[10/14/2020-14:40:09] [I] Dump output: Disabled
[10/14/2020-14:40:09] [I] Profile: Disabled
[10/14/2020-14:40:09] [I] Export timing to JSON file:
[10/14/2020-14:40:09] [I] Export output to JSON file:
[10/14/2020-14:40:09] [I] Export profile to JSON file:
[10/14/2020-14:40:09] [I]
[10/14/2020-14:40:11] [I] Starting inference threads
[10/14/2020-14:40:15] [I] Warmup completed 8 queries over 200 ms
[10/14/2020-14:40:15] [I] Timing trace has 297 queries over 3.01784 s
[10/14/2020-14:40:15] [I] Trace averages of 10 runs:
[10/14/2020-14:40:15] [I] Average on 10 runs - GPU latency: 11.3305 ms - Host latency: 11.6071 ms (end to end 11.616 ms, enqueue 1.59527 ms)
[10/14/2020-14:40:15] [I] Average on 10 runs - GPU latency: 9.77782 ms - Host latency: 10.001 ms (end to end 10.01 ms, enqueue 1.72518 ms)
[10/14/2020-14:40:15] [I] Average on 10 runs - GPU latency: 9.84791 ms - Host latency: 10.0732 ms (end to end 10.0852 ms, enqueue 1.59473 ms)
[10/14/2020-14:40:15] [I] Average on 10 runs - GPU latency: 9.87346 ms - Host latency: 10.099 ms (end to end 10.1084 ms, enqueue 1.52335 ms)
[10/14/2020-14:40:15] [I] Average on 10 runs - GPU latency: 9.83589 ms - Host latency: 10.061 ms (end to end 10.0693 ms, enqueue 1.57678 ms)
[10/14/2020-14:40:15] [I] Average on 10 runs - GPU latency: 9.85862 ms - Host latency: 10.0847 ms (end to end 10.0931 ms, enqueue 1.56196 ms)
[10/14/2020-14:40:15] [I] Average on 10 runs - GPU latency: 9.89319 ms - Host latency: 10.1204 ms (end to end 10.1283 ms, enqueue 1.48432 ms)
[10/14/2020-14:40:15] [I] Average on 10 runs - GPU latency: 9.92749 ms - Host latency: 10.154 ms (end to end 10.1646 ms, enqueue 1.4692 ms)
[10/14/2020-14:40:15] [I] Average on 10 runs - GPU latency: 9.90415 ms - Host latency: 10.13 ms (end to end 10.1401 ms, enqueue 1.56622 ms)
[10/14/2020-14:40:15] [I] Average on 10 runs - GPU latency: 9.85276 ms - Host latency: 10.0779 ms (end to end 10.0886 ms, enqueue 1.49984 ms)
[10/14/2020-14:40:15] [I] Average on 10 runs - GPU latency: 9.90584 ms - Host latency: 10.1321 ms (end to end 10.1435 ms, enqueue 1.4979 ms)
[10/14/2020-14:40:15] [I] Average on 10 runs - GPU latency: 9.90905 ms - Host latency: 10.1363 ms (end to end 10.1471 ms, enqueue 1.46542 ms)
[10/14/2020-14:40:15] [I] Average on 10 runs - GPU latency: 9.9425 ms - Host latency: 10.1701 ms (end to end 10.1799 ms, enqueue 1.47678 ms)
[10/14/2020-14:40:15] [I] Average on 10 runs - GPU latency: 9.98343 ms - Host latency: 10.2133 ms (end to end 10.2208 ms, enqueue 1.59906 ms)
[10/14/2020-14:40:15] [I] Average on 10 runs - GPU latency: 9.90254 ms - Host latency: 10.1302 ms (end to end 10.14 ms, enqueue 1.58828 ms)
[10/14/2020-14:40:15] [I] Average on 10 runs - GPU latency: 9.84609 ms - Host latency: 10.0714 ms (end to end 10.0783 ms, enqueue 1.5175 ms)
[10/14/2020-14:40:15] [I] Average on 10 runs - GPU latency: 9.84589 ms - Host latency: 10.0698 ms (end to end 10.0801 ms, enqueue 1.47031 ms)
[10/14/2020-14:40:15] [I] Average on 10 runs - GPU latency: 9.79431 ms - Host latency: 10.0169 ms (end to end 10.0267 ms, enqueue 1.52322 ms)
[10/14/2020-14:40:15] [I] Average on 10 runs - GPU latency: 9.76464 ms - Host latency: 9.98668 ms (end to end 9.99467 ms, enqueue 1.4767 ms)
[10/14/2020-14:40:15] [I] Average on 10 runs - GPU latency: 9.77678 ms - Host latency: 9.999 ms (end to end 10.0082 ms, enqueue 1.45371 ms)
[10/14/2020-14:40:15] [I] Average on 10 runs - GPU latency: 9.77576 ms - Host latency: 9.99846 ms (end to end 10.0083 ms, enqueue 1.50857 ms)
[10/14/2020-14:40:15] [I] Average on 10 runs - GPU latency: 9.82295 ms - Host latency: 10.047 ms (end to end 10.0573 ms, enqueue 1.46895 ms)
[10/14/2020-14:40:15] [I] Average on 10 runs - GPU latency: 9.84458 ms - Host latency: 10.0704 ms (end to end 10.0778 ms, enqueue 1.42139 ms)
[10/14/2020-14:40:15] [I] Average on 10 runs - GPU latency: 9.89853 ms - Host latency: 10.126 ms (end to end 10.135 ms, enqueue 1.46755 ms)
[10/14/2020-14:40:15] [I] Average on 10 runs - GPU latency: 9.90371 ms - Host latency: 10.1299 ms (end to end 10.1386 ms, enqueue 1.45813 ms)
[10/14/2020-14:40:15] [I] Average on 10 runs - GPU latency: 9.89175 ms - Host latency: 10.1183 ms (end to end 10.128 ms, enqueue 1.40037 ms)
[10/14/2020-14:40:15] [I] Average on 10 runs - GPU latency: 9.93264 ms - Host latency: 10.1595 ms (end to end 10.1672 ms, enqueue 1.5532 ms)
[10/14/2020-14:40:15] [I] Average on 10 runs - GPU latency: 9.97732 ms - Host latency: 10.2068 ms (end to end 10.2163 ms, enqueue 1.35178 ms)
[10/14/2020-14:40:15] [I] Average on 10 runs - GPU latency: 9.97825 ms - Host latency: 10.2085 ms (end to end 10.2168 ms, enqueue 1.41948 ms)
[10/14/2020-14:40:15] [I] Host Latency
[10/14/2020-14:40:15] [I] min: 9.85403 ms (end to end 9.86771 ms)
[10/14/2020-14:40:15] [I] max: 15.4106 ms (end to end 15.4235 ms)
[10/14/2020-14:40:15] [I] mean: 10.1517 ms (end to end 10.161 ms)
[10/14/2020-14:40:15] [I] median: 10.1066 ms (end to end 10.1161 ms)
[10/14/2020-14:40:15] [I] percentile: 15.1415 ms at 99% (end to end 15.151 ms at 99%)
[10/14/2020-14:40:15] [I] throughput: 98.4147 qps
[10/14/2020-14:40:15] [I] walltime: 3.01784 s
[10/14/2020-14:40:15] [I] Enqueue Time
[10/14/2020-14:40:15] [I] min: 1.28564 ms
[10/14/2020-14:40:15] [I] max: 2.9407 ms
[10/14/2020-14:40:15] [I] median: 1.47015 ms
[10/14/2020-14:40:15] [I] GPU Compute
[10/14/2020-14:40:15] [I] min: 9.63342 ms
[10/14/2020-14:40:15] [I] max: 15.0558 ms
[10/14/2020-14:40:15] [I] mean: 9.92404 ms
[10/14/2020-14:40:15] [I] median: 9.88 ms
[10/14/2020-14:40:15] [I] percentile: 14.7866 ms at 99%
[10/14/2020-14:40:15] [I] total compute time: 2.94744 s
&&&& PASSED …

My understand is: since the batch size is 1, on average, the model takes ~10ms per frame.

However, when I measure how long the nvinfer take to process a buffer of size 1, I got ~18ms instead of something closer to 10ms.

Pipeline: camera-source → video converter(s) → cap-filter → streammux → nvinfer → fakesink
Dot file:

Yolo-config:

[property]
gpu-id=0
net-scale-factor=1.0
offsets=103.939;116.779;123.68
model-color-format=1
labelfile-path=./models/yolov3_custom/yolov3_labels.txt
tlt-encoded-model=./models/yolov3_custom/yolo_resnet18_epoch_008.etlt
tlt-model-key=key
model-engine-file=./models/yolov3_custom/yolo_resnet18_epoch_008.etlt_b1_gpu0_fp16.engine
#workspace-size is in MB, 1000MB = 1GB
workspace-size=5000
#channel;height;width;input-order, for input-order: 0=NCHW, 1=NHWC
#uff-input-dims DEPRECATED
#uff-input-dims=3;704;960;0
infer-dims=3;704;960
uff-input-order=0
uff-input-blob-name=Input
batch-size=1
#0=FP32, 1=INT8, 2=FP16 mode
network-mode=2
num-detected-classes=2
interval=0
gie-unique-id=1
#is-classifier=0
#0: Detector, 1: Classifier, 2: Segmentation, 3: Instance Segmentation
network-type=0
#0: OpenCV groupRectangles(), 1: DBSCAN, 2: Non Maximum Suppression, 3: DBSCAN + NMS Hybrid, 4:No clustering
cluster-mode=2
output-blob-names=BatchedNMS
#use custom parser for yolov3, instead of the default custom parser
parse-bbox-func-name=NvDsInferParseCustomYOLOV3TLT
custom-lib-path=./nvdsinfer_customparser_yolov3_tlt/libnvds_infercustomparser_yolov3_tlt.so
#custom-lib-path=/opt/nvidia/deepstream/deepstream-5.0/lib/libnvds_infercustomparser.so
[class-attrs-all]
pre-cluster-threshold=0.5
roi-top-offset=0
roi-bottom-offset=0
detected-min-w=0
detected-min-h=0
detected-max-w=0
detected-max-h=0

I use probes to compute the processing time that nvinfer takes (i.e., time from sink to source), also with the clock boosted, please see below for the code I use to do the measurement, the code is based on this answer:

typedef struct _latency_measure
{
  GstClockTime t1;
  GstClockTime total_time;
  guint count;
} latency_measure;

static GstPadProbeReturn start_buffer_probe(GstPad *pad, GstPadProbeInfo *info, gpointer u_data)
{
  latency_measure *latency = (latency_measure *)(u_data);
  latency->t1 = g_get_monotonic_time(); //microseconds
  return GST_PAD_PROBE_OK;
}

static GstPadProbeReturn end_buffer_probe(GstPad *pad, GstPadProbeInfo *info, gpointer u_data)
{
  latency_measure *latency = (latency_measure *)(u_data);
  latency->total_time += (g_get_monotonic_time() - latency->t1); //microseconds
  latency->count++;
  return GST_PAD_PROBE_OK;
}

int main(int argc, char *argv[])
{ 
  [...]

  latency_measure latency_measure = {.t1 = GST_CLOCK_TIME_NONE, .total_time = GST_CLOCK_TIME_NONE, .count = 0};
  GstPad *start_pad, *end_pad;
  start_pad = gst_element_get_static_pad(pgie, "sink");
  end_pad = gst_element_get_static_pad(pgie, "src");

  if(!start_pad){
    g_print("Unable to get start pad\n");
  } else {
    gst_pad_add_probe(start_pad, GST_PAD_PROBE_TYPE_BUFFER, start_buffer_probe, &latency_measure, NULL);
  }

  if (!end_pad){
    g_print("Unable to get end pad\n");
  } else {
    gst_pad_add_probe(end_pad, GST_PAD_PROBE_TYPE_BUFFER, end_buffer_probe, &latency_measure, NULL);
  }

  gst_object_unref(start_pad);
  gst_object_unref(end_pad);

  [...]
  
  // find average processing time and convert from microsecond to millisecond
  g_print("Average proctime: %.2f ms\n", (latency_measure.total_time / (latency_measure.count - 1.0) / 1000.0));
  
  [...]
}

My question is why there is such a big different between the model performance outside (~10ms) vs. inside (~18ms) the pipeline?

mchi · October 15, 2020, 3:50am

Hi @hyperlight,
In gie, besides TRT inference, there are some pre-process (reize & convert to RGB planar @ network resolution), and post-processing that consumes some time.

Thanks!

Topic		Replies	Views
Engine failed to match config params, trying rebuild DeepStream SDK	10	160	July 9, 2024
Custom YOLOv4 Model Performance DeepStream SDK deepstream	2	623	April 18, 2022
Slow real-time performance when running custom YOLOv3 app DeepStream SDK	9	1984	July 13, 2021
[urgent] FasterRCNN example - running slowly DeepStream SDK	5	900	October 12, 2021
Latency measurement DeepStream SDK	8	1499	December 13, 2021
Delay due to nvinfer DeepStream SDK deepstream	11	36	February 25, 2025
Remove gstreamer pipeline buffering DeepStream SDK gstreamer , deepstream	15	1593	October 2, 2023
Videosink playback having hiccups when setting nvinferserver plugin with large interval DeepStream SDK maxine	8	769	October 30, 2022
Performance drop when using multiple sources DeepStream SDK	27	898	April 29, 2024
DeepStream 5.1, PyTorch, MobileNet SSD v1, retained, ONNX - poor performance DeepStream SDK	8	1713	October 12, 2021

Model ran much slower in deepstream pipeline

Related topics