Model ran much slower in deepstream pipeline

Please provide complete information as applicable to your setup.

• Hardware Platform (Jetson / GPU): Jetson AGX Xavier
• DeepStream Version: 5.0 GA
• JetPack Version (valid for Jetson only): 4.4
• TensorRT Version: 7.1.3
• Issue Type( questions, new requirements, bugs): question

Measure model performance using trtexec:

#boost the clock
sudo nvpmodel -m 0 sudo jetson_clocks
#measure performance
/usr/src/tensorrt/bin/trtexec --loadEngine=./yolo_resnet18_epoch_008.etlt_b1_gpu0_fp16.engine

Output:


[10/14/2020-14:40:09] [I] === Inference Options ===
[10/14/2020-14:40:09] [I] Batch: 1
[10/14/2020-14:40:09] [I] Input inference shapes: model
[10/14/2020-14:40:09] [I] Iterations: 10
[10/14/2020-14:40:09] [I] Duration: 3s (+ 200ms warm up)
[10/14/2020-14:40:09] [I] Sleep time: 0ms
[10/14/2020-14:40:09] [I] Streams: 1
[10/14/2020-14:40:09] [I] ExposeDMA: Disabled
[10/14/2020-14:40:09] [I] Spin-wait: Disabled
[10/14/2020-14:40:09] [I] Multithreading: Disabled
[10/14/2020-14:40:09] [I] CUDA Graph: Disabled
[10/14/2020-14:40:09] [I] Skip inference: Disabled
[10/14/2020-14:40:09] [I] Inputs:
[10/14/2020-14:40:09] [I] === Reporting Options ===
[10/14/2020-14:40:09] [I] Verbose: Disabled
[10/14/2020-14:40:09] [I] Averages: 10 inferences
[10/14/2020-14:40:09] [I] Percentile: 99
[10/14/2020-14:40:09] [I] Dump output: Disabled
[10/14/2020-14:40:09] [I] Profile: Disabled
[10/14/2020-14:40:09] [I] Export timing to JSON file:
[10/14/2020-14:40:09] [I] Export output to JSON file:
[10/14/2020-14:40:09] [I] Export profile to JSON file:
[10/14/2020-14:40:09] [I]
[10/14/2020-14:40:11] [I] Starting inference threads
[10/14/2020-14:40:15] [I] Warmup completed 8 queries over 200 ms
[10/14/2020-14:40:15] [I] Timing trace has 297 queries over 3.01784 s
[10/14/2020-14:40:15] [I] Trace averages of 10 runs:
[10/14/2020-14:40:15] [I] Average on 10 runs - GPU latency: 11.3305 ms - Host latency: 11.6071 ms (end to end 11.616 ms, enqueue 1.59527 ms)
[10/14/2020-14:40:15] [I] Average on 10 runs - GPU latency: 9.77782 ms - Host latency: 10.001 ms (end to end 10.01 ms, enqueue 1.72518 ms)
[10/14/2020-14:40:15] [I] Average on 10 runs - GPU latency: 9.84791 ms - Host latency: 10.0732 ms (end to end 10.0852 ms, enqueue 1.59473 ms)
[10/14/2020-14:40:15] [I] Average on 10 runs - GPU latency: 9.87346 ms - Host latency: 10.099 ms (end to end 10.1084 ms, enqueue 1.52335 ms)
[10/14/2020-14:40:15] [I] Average on 10 runs - GPU latency: 9.83589 ms - Host latency: 10.061 ms (end to end 10.0693 ms, enqueue 1.57678 ms)
[10/14/2020-14:40:15] [I] Average on 10 runs - GPU latency: 9.85862 ms - Host latency: 10.0847 ms (end to end 10.0931 ms, enqueue 1.56196 ms)
[10/14/2020-14:40:15] [I] Average on 10 runs - GPU latency: 9.89319 ms - Host latency: 10.1204 ms (end to end 10.1283 ms, enqueue 1.48432 ms)
[10/14/2020-14:40:15] [I] Average on 10 runs - GPU latency: 9.92749 ms - Host latency: 10.154 ms (end to end 10.1646 ms, enqueue 1.4692 ms)
[10/14/2020-14:40:15] [I] Average on 10 runs - GPU latency: 9.90415 ms - Host latency: 10.13 ms (end to end 10.1401 ms, enqueue 1.56622 ms)
[10/14/2020-14:40:15] [I] Average on 10 runs - GPU latency: 9.85276 ms - Host latency: 10.0779 ms (end to end 10.0886 ms, enqueue 1.49984 ms)
[10/14/2020-14:40:15] [I] Average on 10 runs - GPU latency: 9.90584 ms - Host latency: 10.1321 ms (end to end 10.1435 ms, enqueue 1.4979 ms)
[10/14/2020-14:40:15] [I] Average on 10 runs - GPU latency: 9.90905 ms - Host latency: 10.1363 ms (end to end 10.1471 ms, enqueue 1.46542 ms)
[10/14/2020-14:40:15] [I] Average on 10 runs - GPU latency: 9.9425 ms - Host latency: 10.1701 ms (end to end 10.1799 ms, enqueue 1.47678 ms)
[10/14/2020-14:40:15] [I] Average on 10 runs - GPU latency: 9.98343 ms - Host latency: 10.2133 ms (end to end 10.2208 ms, enqueue 1.59906 ms)
[10/14/2020-14:40:15] [I] Average on 10 runs - GPU latency: 9.90254 ms - Host latency: 10.1302 ms (end to end 10.14 ms, enqueue 1.58828 ms)
[10/14/2020-14:40:15] [I] Average on 10 runs - GPU latency: 9.84609 ms - Host latency: 10.0714 ms (end to end 10.0783 ms, enqueue 1.5175 ms)
[10/14/2020-14:40:15] [I] Average on 10 runs - GPU latency: 9.84589 ms - Host latency: 10.0698 ms (end to end 10.0801 ms, enqueue 1.47031 ms)
[10/14/2020-14:40:15] [I] Average on 10 runs - GPU latency: 9.79431 ms - Host latency: 10.0169 ms (end to end 10.0267 ms, enqueue 1.52322 ms)
[10/14/2020-14:40:15] [I] Average on 10 runs - GPU latency: 9.76464 ms - Host latency: 9.98668 ms (end to end 9.99467 ms, enqueue 1.4767 ms)
[10/14/2020-14:40:15] [I] Average on 10 runs - GPU latency: 9.77678 ms - Host latency: 9.999 ms (end to end 10.0082 ms, enqueue 1.45371 ms)
[10/14/2020-14:40:15] [I] Average on 10 runs - GPU latency: 9.77576 ms - Host latency: 9.99846 ms (end to end 10.0083 ms, enqueue 1.50857 ms)
[10/14/2020-14:40:15] [I] Average on 10 runs - GPU latency: 9.82295 ms - Host latency: 10.047 ms (end to end 10.0573 ms, enqueue 1.46895 ms)
[10/14/2020-14:40:15] [I] Average on 10 runs - GPU latency: 9.84458 ms - Host latency: 10.0704 ms (end to end 10.0778 ms, enqueue 1.42139 ms)
[10/14/2020-14:40:15] [I] Average on 10 runs - GPU latency: 9.89853 ms - Host latency: 10.126 ms (end to end 10.135 ms, enqueue 1.46755 ms)
[10/14/2020-14:40:15] [I] Average on 10 runs - GPU latency: 9.90371 ms - Host latency: 10.1299 ms (end to end 10.1386 ms, enqueue 1.45813 ms)
[10/14/2020-14:40:15] [I] Average on 10 runs - GPU latency: 9.89175 ms - Host latency: 10.1183 ms (end to end 10.128 ms, enqueue 1.40037 ms)
[10/14/2020-14:40:15] [I] Average on 10 runs - GPU latency: 9.93264 ms - Host latency: 10.1595 ms (end to end 10.1672 ms, enqueue 1.5532 ms)
[10/14/2020-14:40:15] [I] Average on 10 runs - GPU latency: 9.97732 ms - Host latency: 10.2068 ms (end to end 10.2163 ms, enqueue 1.35178 ms)
[10/14/2020-14:40:15] [I] Average on 10 runs - GPU latency: 9.97825 ms - Host latency: 10.2085 ms (end to end 10.2168 ms, enqueue 1.41948 ms)
[10/14/2020-14:40:15] [I] Host Latency
[10/14/2020-14:40:15] [I] min: 9.85403 ms (end to end 9.86771 ms)
[10/14/2020-14:40:15] [I] max: 15.4106 ms (end to end 15.4235 ms)
[10/14/2020-14:40:15] [I] mean: 10.1517 ms (end to end 10.161 ms)
[10/14/2020-14:40:15] [I] median: 10.1066 ms (end to end 10.1161 ms)
[10/14/2020-14:40:15] [I] percentile: 15.1415 ms at 99% (end to end 15.151 ms at 99%)
[10/14/2020-14:40:15] [I] throughput: 98.4147 qps
[10/14/2020-14:40:15] [I] walltime: 3.01784 s
[10/14/2020-14:40:15] [I] Enqueue Time
[10/14/2020-14:40:15] [I] min: 1.28564 ms
[10/14/2020-14:40:15] [I] max: 2.9407 ms
[10/14/2020-14:40:15] [I] median: 1.47015 ms
[10/14/2020-14:40:15] [I] GPU Compute
[10/14/2020-14:40:15] [I] min: 9.63342 ms
[10/14/2020-14:40:15] [I] max: 15.0558 ms
[10/14/2020-14:40:15] [I] mean: 9.92404 ms
[10/14/2020-14:40:15] [I] median: 9.88 ms
[10/14/2020-14:40:15] [I] percentile: 14.7866 ms at 99%
[10/14/2020-14:40:15] [I] total compute time: 2.94744 s
&&&& PASSED …

My understand is: since the batch size is 1, on average, the model takes ~10ms per frame.

However, when I measure how long the nvinfer take to process a buffer of size 1, I got ~18ms instead of something closer to 10ms.

Pipeline: camera-source → video converter(s) → cap-filter → streammux → nvinfer → fakesink
Dot file:

Yolo-config:

[property]
gpu-id=0
net-scale-factor=1.0
offsets=103.939;116.779;123.68
model-color-format=1
labelfile-path=./models/yolov3_custom/yolov3_labels.txt
tlt-encoded-model=./models/yolov3_custom/yolo_resnet18_epoch_008.etlt
tlt-model-key=key
model-engine-file=./models/yolov3_custom/yolo_resnet18_epoch_008.etlt_b1_gpu0_fp16.engine
#workspace-size is in MB, 1000MB = 1GB
workspace-size=5000
#channel;height;width;input-order, for input-order: 0=NCHW, 1=NHWC
#uff-input-dims DEPRECATED
#uff-input-dims=3;704;960;0
infer-dims=3;704;960
uff-input-order=0
uff-input-blob-name=Input
batch-size=1
#0=FP32, 1=INT8, 2=FP16 mode
network-mode=2
num-detected-classes=2
interval=0
gie-unique-id=1
#is-classifier=0
#0: Detector, 1: Classifier, 2: Segmentation, 3: Instance Segmentation
network-type=0
#0: OpenCV groupRectangles(), 1: DBSCAN, 2: Non Maximum Suppression, 3: DBSCAN + NMS Hybrid, 4:No clustering
cluster-mode=2
output-blob-names=BatchedNMS
#use custom parser for yolov3, instead of the default custom parser
parse-bbox-func-name=NvDsInferParseCustomYOLOV3TLT
custom-lib-path=./nvdsinfer_customparser_yolov3_tlt/libnvds_infercustomparser_yolov3_tlt.so
#custom-lib-path=/opt/nvidia/deepstream/deepstream-5.0/lib/libnvds_infercustomparser.so
[class-attrs-all]
pre-cluster-threshold=0.5
roi-top-offset=0
roi-bottom-offset=0
detected-min-w=0
detected-min-h=0
detected-max-w=0
detected-max-h=0

I use probes to compute the processing time that nvinfer takes (i.e., time from sink to source), also with the clock boosted, please see below for the code I use to do the measurement, the code is based on this answer:

typedef struct _latency_measure
{
  GstClockTime t1;
  GstClockTime total_time;
  guint count;
} latency_measure;

static GstPadProbeReturn start_buffer_probe(GstPad *pad, GstPadProbeInfo *info, gpointer u_data)
{
  latency_measure *latency = (latency_measure *)(u_data);
  latency->t1 = g_get_monotonic_time(); //microseconds
  return GST_PAD_PROBE_OK;
}

static GstPadProbeReturn end_buffer_probe(GstPad *pad, GstPadProbeInfo *info, gpointer u_data)
{
  latency_measure *latency = (latency_measure *)(u_data);
  latency->total_time += (g_get_monotonic_time() - latency->t1); //microseconds
  latency->count++;
  return GST_PAD_PROBE_OK;
}

int main(int argc, char *argv[])
{ 
  [...]

  latency_measure latency_measure = {.t1 = GST_CLOCK_TIME_NONE, .total_time = GST_CLOCK_TIME_NONE, .count = 0};
  GstPad *start_pad, *end_pad;
  start_pad = gst_element_get_static_pad(pgie, "sink");
  end_pad = gst_element_get_static_pad(pgie, "src");

  if(!start_pad){
    g_print("Unable to get start pad\n");
  } else {
    gst_pad_add_probe(start_pad, GST_PAD_PROBE_TYPE_BUFFER, start_buffer_probe, &latency_measure, NULL);
  }

  if (!end_pad){
    g_print("Unable to get end pad\n");
  } else {
    gst_pad_add_probe(end_pad, GST_PAD_PROBE_TYPE_BUFFER, end_buffer_probe, &latency_measure, NULL);
  }

  gst_object_unref(start_pad);
  gst_object_unref(end_pad);

  [...]
  
  // find average processing time and convert from microsecond to millisecond
  g_print("Average proctime: %.2f ms\n", (latency_measure.total_time / (latency_measure.count - 1.0) / 1000.0));
  
  [...]
}

My question is why there is such a big different between the model performance outside (~10ms) vs. inside (~18ms) the pipeline?

Hi @hyperlight,
In gie, besides TRT inference, there are some pre-process (reize & convert to RGB planar @ network resolution), and post-processing that consumes some time.

Thanks!