[urgent] FasterRCNN example - running slowly

• Hardware Platform (Jetson / GPU) Jetson Xavier NX
• DeepStream Version nvcr.io/nvidia/deepstream-l4t 5.0-20.07-samples cbc3f5fb67a5
• JetPack Version (valid for Jetson only) Jetpack 4.4 [L4T 32.4.3]
• TensorRT Version

  • CUDA: 10.2.89
  • cuDNN: 8.0.0.180
  • TensorRT: 7.1.3.0
  • Visionworks: 1.6.0.501
  • OpenCV: 4.1.1 compiled CUDA: NO
  • VPI: 0.3.7
  • Vulkan: 1.2.70

• How to reproduce the issue ?

  1. Pull deepstream-l4t:5.0 docker image and run.
docker pull nvcr.io/nvidia/deepstream-l4t:5.0-20.07-samples
xhost +
sudo docker run -it --rm --net=host --runtime nvidia  -e DISPLAY=$DISPLAY -w /opt/nvidia/deepstream/deepstream-5.0 -v /tmp/.X11-unix/:/tmp/.X11-unix nvcr.io/nvidia/deepstream-l4t:5.0-20.07-samples
  1. Go to /opt/nvidia/deepstream/deepstream-5.0/sources/objectDetector_FasterRCNN
  2. Follow instructions as described in READEME file.

• Issue Type
FPS is very low. Please see the log.

/opt/nvidia/deepstream/deepstream-5.0/deeptray/objectDetector_FasterRCNN# deepstream-app -c deepstream_app_config_fasterRCNN.txt
2020-10-06 23:59:34.277560: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.2
Warn: 'threshold' parameter has been deprecated. Use 'pre-cluster-threshold' instead.
Warn: 'threshold' parameter has been deprecated. Use 'pre-cluster-threshold' instead.

Using winsys: x11 
0:00:18.722548301    13     0x22ded890 INFO                 nvinfer gstnvinfer.cpp:619:gst_nvinfer_logger:<primary_gie> NvDsInferContext[UID 1]: Info from NvDsInferContextImpl::deserializeEngineAndBackend() <nvdsinfer_context_impl.cpp:1701> [UID = 1]: deserialized trt engine from :/opt/nvidia/deepstream/deepstream-5.0/deeptray/objectDetector_FasterRCNN/VGG16_faster_rcnn_final.caffemodel_b1_gpu0_fp32.engine
INFO: [Implicit Engine Info]: layers num: 5
0   INPUT  kFLOAT data            3x375x500       
1   INPUT  kFLOAT im_info         1x1x3           
2   OUTPUT kFLOAT rois            1x300x4         
3   OUTPUT kFLOAT bbox_pred       300x84x1x1      
4   OUTPUT kFLOAT cls_prob        300x21x1x1      

0:00:18.723044752    13     0x22ded890 INFO                 nvinfer gstnvinfer.cpp:619:gst_nvinfer_logger:<primary_gie> NvDsInferContext[UID 1]: Info from NvDsInferContextImpl::generateBackendContext() <nvdsinfer_context_impl.cpp:1805> [UID = 1]: Use deserialized engine model: /opt/nvidia/deepstream/deepstream-5.0/deeptray/objectDetector_FasterRCNN/VGG16_faster_rcnn_final.caffemodel_b1_gpu0_fp32.engine
0:00:18.973798997    13     0x22ded890 INFO                 nvinfer gstnvinfer_impl.cpp:313:notifyLoadModelStatus:<primary_gie> [UID 1]: Load new model:/opt/nvidia/deepstream/deepstream-5.0/deeptray/objectDetector_FasterRCNN/config_infer_primary_fasterRCNN.txt sucessfully

Runtime commands:
	h: Print this help
	q: Quit

	p: Pause
	r: Resume


**PERF:  FPS 0 (Avg)	
**PERF:  0.00 (0.00)	
** INFO: <bus_callback:181>: Pipeline ready

Opening in BLOCKING MODE 
NvMMLiteOpen : Block : BlockType = 261 
NVMEDIA: Reading vendor.tegra.display-size : status: 6 
NvMMLiteBlockCreate : Block : BlockType = 261 
** INFO: <bus_callback:167>: Pipeline running

**PERF:  0.00 (0.00)	
**PERF:  4.45 (4.07)	
**PERF:  4.43 (4.04)	
**PERF:  4.50 (4.33)	
**PERF:  4.51 (4.25)	
**PERF:  4.36 (4.40)	
**PERF:  4.48 (4.33)	
**PERF:  4.52 (4.42)	
**PERF:  4.51 (4.37)	
**PERF:  4.48 (4.44)	
**PERF:  4.44 (4.40)	
**PERF:  4.51 (4.45)	
**PERF:  4.50 (4.41)	
**PERF:  4.51 (4.46)	
**PERF:  4.49 (4.43)	
**PERF:  4.48 (4.46)	
**PERF:  4.50 (4.43)	
**PERF:  4.49 (4.47)	
**PERF:  4.52 (4.44)	

**PERF:  FPS 0 (Avg)	
**PERF:  4.51 (4.47)	
**PERF:  4.51 (4.45)	
**PERF:  4.49 (4.47)	
**PERF:  4.49 (4.45)	
**PERF:  4.48 (4.48)	
**PERF:  4.51 (4.46)	
**PERF:  4.50 (4.48)	
**PERF:  4.50 (4.46)	
**PERF:  4.49 (4.48)	
**PERF:  4.49 (4.46)	
**PERF:  4.49 (4.48)	
**PERF:  4.48 (4.46)	
**PERF:  4.50 (4.48)	
**PERF:  4.49 (4.47)	
**PERF:  4.51 (4.48)	
**PERF:  4.49 (4.47)	
**PERF:  4.50 (4.48)	
**PERF:  4.51 (4.47)	
**PERF:  4.49 (4.48)	
**PERF:  4.49 (4.47)	

**PERF:  FPS 0 (Avg)	
**PERF:  4.50 (4.44)	
**PERF:  4.46 (4.45)	
**PERF:  4.48 (4.44)	
**PERF:  4.49 (4.45)	
**PERF:  4.46 (4.44)	
**PERF:  4.48 (4.44)	
**PERF:  4.42 (4.44)	
^C** ERROR: <_intr_handler:140>: User Interrupted.. 

**PERF:  4.48 (4.44)	
Quitting
App run successful

I need your help to find the issue. Please help, it’s quite urgent for us. Thanks.

I am also getting mean: 220.275 ms (end to end 220.341 ms)
when I run /usr/src/tensorrt/bin/trtexec --loadEngine=VGG16_faster_rcnn_final.caffemodel_b1_gpu0_fp32.engine --int8 --batch=1 --useSpinWait

/opt/nvidia/deepstream/deepstream-5.0/deeptray/objectDetector_FasterRCNN# /usr/src/tensorrt/bin/trtexec --loadEngine=VGG16_faster_rcnn_final.caffemodel_b1_gpu0_fp32.engine --int8 --batch=1 --useSpinWait
&&&& RUNNING TensorRT.trtexec # /usr/src/tensorrt/bin/trtexec --loadEngine=VGG16_faster_rcnn_final.caffemodel_b1_gpu0_fp32.engine --int8 --batch=1 --useSpinWait
[10/07/2020-00:32:15] [I] === Model Options ===
[10/07/2020-00:32:15] [I] Format: *
[10/07/2020-00:32:15] [I] Model: 
[10/07/2020-00:32:15] [I] Output:
[10/07/2020-00:32:15] [I] === Build Options ===
[10/07/2020-00:32:15] [I] Max batch: 1
[10/07/2020-00:32:15] [I] Workspace: 16 MB
[10/07/2020-00:32:15] [I] minTiming: 1
[10/07/2020-00:32:15] [I] avgTiming: 8
[10/07/2020-00:32:15] [I] Precision: FP32+INT8
[10/07/2020-00:32:15] [I] Calibration: Dynamic
[10/07/2020-00:32:15] [I] Safe mode: Disabled
[10/07/2020-00:32:15] [I] Save engine: 
[10/07/2020-00:32:15] [I] Load engine: VGG16_faster_rcnn_final.caffemodel_b1_gpu0_fp32.engine
[10/07/2020-00:32:15] [I] Builder Cache: Enabled
[10/07/2020-00:32:15] [I] NVTX verbosity: 0
[10/07/2020-00:32:15] [I] Inputs format: fp32:CHW
[10/07/2020-00:32:15] [I] Outputs format: fp32:CHW
[10/07/2020-00:32:15] [I] Input build shapes: model
[10/07/2020-00:32:15] [I] Input calibration shapes: model
[10/07/2020-00:32:15] [I] === System Options ===
[10/07/2020-00:32:15] [I] Device: 0
[10/07/2020-00:32:15] [I] DLACore: 
[10/07/2020-00:32:15] [I] Plugins:
[10/07/2020-00:32:15] [I] === Inference Options ===
[10/07/2020-00:32:15] [I] Batch: 1
[10/07/2020-00:32:15] [I] Input inference shapes: model
[10/07/2020-00:32:15] [I] Iterations: 10
[10/07/2020-00:32:15] [I] Duration: 3s (+ 200ms warm up)
[10/07/2020-00:32:15] [I] Sleep time: 0ms
[10/07/2020-00:32:15] [I] Streams: 1
[10/07/2020-00:32:15] [I] ExposeDMA: Disabled
[10/07/2020-00:32:15] [I] Spin-wait: Enabled
[10/07/2020-00:32:15] [I] Multithreading: Disabled
[10/07/2020-00:32:15] [I] CUDA Graph: Disabled
[10/07/2020-00:32:15] [I] Skip inference: Disabled
[10/07/2020-00:32:15] [I] Inputs:
[10/07/2020-00:32:15] [I] === Reporting Options ===
[10/07/2020-00:32:15] [I] Verbose: Disabled
[10/07/2020-00:32:15] [I] Averages: 10 inferences
[10/07/2020-00:32:15] [I] Percentile: 99
[10/07/2020-00:32:15] [I] Dump output: Disabled
[10/07/2020-00:32:15] [I] Profile: Disabled
[10/07/2020-00:32:15] [I] Export timing to JSON file: 
[10/07/2020-00:32:15] [I] Export output to JSON file: 
[10/07/2020-00:32:15] [I] Export profile to JSON file: 
[10/07/2020-00:32:15] [I] 
[10/07/2020-00:32:24] [I] Starting inference threads
[10/07/2020-00:32:28] [I] Warmup completed 1 queries over 200 ms
[10/07/2020-00:32:28] [I] Timing trace has 16 queries over 3.52547 s
[10/07/2020-00:32:28] [I] Trace averages of 10 runs:
[10/07/2020-00:32:28] [I] Average on 10 runs - GPU latency: 218.618 ms - Host latency: 218.739 ms (end to end 218.838 ms, enqueue 1.46477 ms)
[10/07/2020-00:32:28] [I] Host Latency
[10/07/2020-00:32:28] [I] min: 212.803 ms (end to end 212.809 ms)
[10/07/2020-00:32:28] [I] max: 236.222 ms (end to end 236.231 ms)
[10/07/2020-00:32:28] [I] mean: 220.275 ms (end to end 220.341 ms)
[10/07/2020-00:32:28] [I] median: 219.229 ms (end to end 219.358 ms)
[10/07/2020-00:32:28] [I] percentile: 236.222 ms at 99% (end to end 236.231 ms at 99%)
[10/07/2020-00:32:28] [I] throughput: 4.5384 qps
[10/07/2020-00:32:28] [I] walltime: 3.52547 s
[10/07/2020-00:32:28] [I] Enqueue Time
[10/07/2020-00:32:28] [I] min: 1.19866 ms
[10/07/2020-00:32:28] [I] max: 2.07568 ms
[10/07/2020-00:32:28] [I] median: 1.56665 ms
[10/07/2020-00:32:28] [I] GPU Compute
[10/07/2020-00:32:28] [I] min: 212.686 ms
[10/07/2020-00:32:28] [I] max: 236.108 ms
[10/07/2020-00:32:28] [I] mean: 220.156 ms
[10/07/2020-00:32:28] [I] median: 219.107 ms
[10/07/2020-00:32:28] [I] percentile: 236.108 ms at 99%
[10/07/2020-00:32:28] [I] total compute time: 3.52249 s
&&&& PASSED TensorRT.trtexec # /usr/src/tensorrt/bin/trtexec --loadEngine=VGG16_faster_rcnn_final.caffemodel_b1_gpu0_fp32.engine --int8 --batch=1 --useSpinWait

so what I am getting is the normal speed of this model?

Hi,

We are going to check this.

In case you don’t know, have you maximized the device performance first?

$ sudo nvpmodel -m 0
$ sudo jetson_clocks

Thanks.

you mean

sudo nvpmodel -m 2 #for xavier nx 

right?

Hi,

For maximal GPU/CPU clock rate, it should be mode 0:
https://docs.nvidia.com/jetson/l4t/index.html#page/Tegra%20Linux%20Driver%20Package%20Development%20Guide%2Fpower_management_jetson_xavier.html%23wwpID0E0NO0HA

Mode 2 is used for CPU intensive jobs since it enables all the CPU cores for usage.

After testing, it’s recommended to use INT8 for inference on XavierNX.
We can get around 15fps for objectDetector_FasterRCNN.

Please remember to update the config_infer_primary_fasterRCNN.txt with the following:

[property]
...
## 0=FP32, 1=INT8, 2=FP16 mode
network-mode=1

Please noticed the inference precision is decided when the building time.
In your testing with trtexec, the model is inference with fp32 mode rather than int8.

/usr/src/tensorrt/bin/trtexec --loadEngine=VGG16_faster_rcnn_final.caffemodel_b1_gpu0_fp32.engine --int8 --batch=1 --useSpinWait

Thanks.