[urgent] FasterRCNN example - running slowly

cogbot · October 7, 2020, 12:19am

• Hardware Platform (Jetson / GPU) Jetson Xavier NX
• DeepStream Version nvcr.io/nvidia/deepstream-l4t 5.0-20.07-samples cbc3f5fb67a5
• JetPack Version (valid for Jetson only) Jetpack 4.4 [L4T 32.4.3]
• TensorRT Version

CUDA: 10.2.89
cuDNN: 8.0.0.180
TensorRT: 7.1.3.0
Visionworks: 1.6.0.501
OpenCV: 4.1.1 compiled CUDA: NO
VPI: 0.3.7
Vulkan: 1.2.70

• How to reproduce the issue ?

Pull deepstream-l4t:5.0 docker image and run.

docker pull nvcr.io/nvidia/deepstream-l4t:5.0-20.07-samples
xhost +
sudo docker run -it --rm --net=host --runtime nvidia  -e DISPLAY=$DISPLAY -w /opt/nvidia/deepstream/deepstream-5.0 -v /tmp/.X11-unix/:/tmp/.X11-unix nvcr.io/nvidia/deepstream-l4t:5.0-20.07-samples

Go to /opt/nvidia/deepstream/deepstream-5.0/sources/objectDetector_FasterRCNN
Follow instructions as described in READEME file.

• Issue Type
FPS is very low. Please see the log.

/opt/nvidia/deepstream/deepstream-5.0/deeptray/objectDetector_FasterRCNN# deepstream-app -c deepstream_app_config_fasterRCNN.txt
2020-10-06 23:59:34.277560: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.2
Warn: 'threshold' parameter has been deprecated. Use 'pre-cluster-threshold' instead.
Warn: 'threshold' parameter has been deprecated. Use 'pre-cluster-threshold' instead.

Using winsys: x11 
0:00:18.722548301    13     0x22ded890 INFO                 nvinfer gstnvinfer.cpp:619:gst_nvinfer_logger:<primary_gie> NvDsInferContext[UID 1]: Info from NvDsInferContextImpl::deserializeEngineAndBackend() <nvdsinfer_context_impl.cpp:1701> [UID = 1]: deserialized trt engine from :/opt/nvidia/deepstream/deepstream-5.0/deeptray/objectDetector_FasterRCNN/VGG16_faster_rcnn_final.caffemodel_b1_gpu0_fp32.engine
INFO: [Implicit Engine Info]: layers num: 5
0   INPUT  kFLOAT data            3x375x500       
1   INPUT  kFLOAT im_info         1x1x3           
2   OUTPUT kFLOAT rois            1x300x4         
3   OUTPUT kFLOAT bbox_pred       300x84x1x1      
4   OUTPUT kFLOAT cls_prob        300x21x1x1      

0:00:18.723044752    13     0x22ded890 INFO                 nvinfer gstnvinfer.cpp:619:gst_nvinfer_logger:<primary_gie> NvDsInferContext[UID 1]: Info from NvDsInferContextImpl::generateBackendContext() <nvdsinfer_context_impl.cpp:1805> [UID = 1]: Use deserialized engine model: /opt/nvidia/deepstream/deepstream-5.0/deeptray/objectDetector_FasterRCNN/VGG16_faster_rcnn_final.caffemodel_b1_gpu0_fp32.engine
0:00:18.973798997    13     0x22ded890 INFO                 nvinfer gstnvinfer_impl.cpp:313:notifyLoadModelStatus:<primary_gie> [UID 1]: Load new model:/opt/nvidia/deepstream/deepstream-5.0/deeptray/objectDetector_FasterRCNN/config_infer_primary_fasterRCNN.txt sucessfully

Runtime commands:
	h: Print this help
	q: Quit

	p: Pause
	r: Resume


**PERF:  FPS 0 (Avg)	
**PERF:  0.00 (0.00)	
** INFO: <bus_callback:181>: Pipeline ready

Opening in BLOCKING MODE 
NvMMLiteOpen : Block : BlockType = 261 
NVMEDIA: Reading vendor.tegra.display-size : status: 6 
NvMMLiteBlockCreate : Block : BlockType = 261 
** INFO: <bus_callback:167>: Pipeline running

**PERF:  0.00 (0.00)	
**PERF:  4.45 (4.07)	
**PERF:  4.43 (4.04)	
**PERF:  4.50 (4.33)	
**PERF:  4.51 (4.25)	
**PERF:  4.36 (4.40)	
**PERF:  4.48 (4.33)	
**PERF:  4.52 (4.42)	
**PERF:  4.51 (4.37)	
**PERF:  4.48 (4.44)	
**PERF:  4.44 (4.40)	
**PERF:  4.51 (4.45)	
**PERF:  4.50 (4.41)	
**PERF:  4.51 (4.46)	
**PERF:  4.49 (4.43)	
**PERF:  4.48 (4.46)	
**PERF:  4.50 (4.43)	
**PERF:  4.49 (4.47)	
**PERF:  4.52 (4.44)	

**PERF:  FPS 0 (Avg)	
**PERF:  4.51 (4.47)	
**PERF:  4.51 (4.45)	
**PERF:  4.49 (4.47)	
**PERF:  4.49 (4.45)	
**PERF:  4.48 (4.48)	
**PERF:  4.51 (4.46)	
**PERF:  4.50 (4.48)	
**PERF:  4.50 (4.46)	
**PERF:  4.49 (4.48)	
**PERF:  4.49 (4.46)	
**PERF:  4.49 (4.48)	
**PERF:  4.48 (4.46)	
**PERF:  4.50 (4.48)	
**PERF:  4.49 (4.47)	
**PERF:  4.51 (4.48)	
**PERF:  4.49 (4.47)	
**PERF:  4.50 (4.48)	
**PERF:  4.51 (4.47)	
**PERF:  4.49 (4.48)	
**PERF:  4.49 (4.47)	

**PERF:  FPS 0 (Avg)	
**PERF:  4.50 (4.44)	
**PERF:  4.46 (4.45)	
**PERF:  4.48 (4.44)	
**PERF:  4.49 (4.45)	
**PERF:  4.46 (4.44)	
**PERF:  4.48 (4.44)	
**PERF:  4.42 (4.44)	
^C** ERROR: <_intr_handler:140>: User Interrupted.. 

**PERF:  4.48 (4.44)	
Quitting
App run successful

I need your help to find the issue. Please help, it’s quite urgent for us. Thanks.

cogbot · October 7, 2020, 12:35am

I am also getting mean: 220.275 ms (end to end 220.341 ms)
when I run /usr/src/tensorrt/bin/trtexec --loadEngine=VGG16_faster_rcnn_final.caffemodel_b1_gpu0_fp32.engine --int8 --batch=1 --useSpinWait

/opt/nvidia/deepstream/deepstream-5.0/deeptray/objectDetector_FasterRCNN# /usr/src/tensorrt/bin/trtexec --loadEngine=VGG16_faster_rcnn_final.caffemodel_b1_gpu0_fp32.engine --int8 --batch=1 --useSpinWait
&&&& RUNNING TensorRT.trtexec # /usr/src/tensorrt/bin/trtexec --loadEngine=VGG16_faster_rcnn_final.caffemodel_b1_gpu0_fp32.engine --int8 --batch=1 --useSpinWait
[10/07/2020-00:32:15] [I] === Model Options ===
[10/07/2020-00:32:15] [I] Format: *
[10/07/2020-00:32:15] [I] Model: 
[10/07/2020-00:32:15] [I] Output:
[10/07/2020-00:32:15] [I] === Build Options ===
[10/07/2020-00:32:15] [I] Max batch: 1
[10/07/2020-00:32:15] [I] Workspace: 16 MB
[10/07/2020-00:32:15] [I] minTiming: 1
[10/07/2020-00:32:15] [I] avgTiming: 8
[10/07/2020-00:32:15] [I] Precision: FP32+INT8
[10/07/2020-00:32:15] [I] Calibration: Dynamic
[10/07/2020-00:32:15] [I] Safe mode: Disabled
[10/07/2020-00:32:15] [I] Save engine: 
[10/07/2020-00:32:15] [I] Load engine: VGG16_faster_rcnn_final.caffemodel_b1_gpu0_fp32.engine
[10/07/2020-00:32:15] [I] Builder Cache: Enabled
[10/07/2020-00:32:15] [I] NVTX verbosity: 0
[10/07/2020-00:32:15] [I] Inputs format: fp32:CHW
[10/07/2020-00:32:15] [I] Outputs format: fp32:CHW
[10/07/2020-00:32:15] [I] Input build shapes: model
[10/07/2020-00:32:15] [I] Input calibration shapes: model
[10/07/2020-00:32:15] [I] === System Options ===
[10/07/2020-00:32:15] [I] Device: 0
[10/07/2020-00:32:15] [I] DLACore: 
[10/07/2020-00:32:15] [I] Plugins:
[10/07/2020-00:32:15] [I] === Inference Options ===
[10/07/2020-00:32:15] [I] Batch: 1
[10/07/2020-00:32:15] [I] Input inference shapes: model
[10/07/2020-00:32:15] [I] Iterations: 10
[10/07/2020-00:32:15] [I] Duration: 3s (+ 200ms warm up)
[10/07/2020-00:32:15] [I] Sleep time: 0ms
[10/07/2020-00:32:15] [I] Streams: 1
[10/07/2020-00:32:15] [I] ExposeDMA: Disabled
[10/07/2020-00:32:15] [I] Spin-wait: Enabled
[10/07/2020-00:32:15] [I] Multithreading: Disabled
[10/07/2020-00:32:15] [I] CUDA Graph: Disabled
[10/07/2020-00:32:15] [I] Skip inference: Disabled
[10/07/2020-00:32:15] [I] Inputs:
[10/07/2020-00:32:15] [I] === Reporting Options ===
[10/07/2020-00:32:15] [I] Verbose: Disabled
[10/07/2020-00:32:15] [I] Averages: 10 inferences
[10/07/2020-00:32:15] [I] Percentile: 99
[10/07/2020-00:32:15] [I] Dump output: Disabled
[10/07/2020-00:32:15] [I] Profile: Disabled
[10/07/2020-00:32:15] [I] Export timing to JSON file: 
[10/07/2020-00:32:15] [I] Export output to JSON file: 
[10/07/2020-00:32:15] [I] Export profile to JSON file: 
[10/07/2020-00:32:15] [I] 
[10/07/2020-00:32:24] [I] Starting inference threads
[10/07/2020-00:32:28] [I] Warmup completed 1 queries over 200 ms
[10/07/2020-00:32:28] [I] Timing trace has 16 queries over 3.52547 s
[10/07/2020-00:32:28] [I] Trace averages of 10 runs:
[10/07/2020-00:32:28] [I] Average on 10 runs - GPU latency: 218.618 ms - Host latency: 218.739 ms (end to end 218.838 ms, enqueue 1.46477 ms)
[10/07/2020-00:32:28] [I] Host Latency
[10/07/2020-00:32:28] [I] min: 212.803 ms (end to end 212.809 ms)
[10/07/2020-00:32:28] [I] max: 236.222 ms (end to end 236.231 ms)
[10/07/2020-00:32:28] [I] mean: 220.275 ms (end to end 220.341 ms)
[10/07/2020-00:32:28] [I] median: 219.229 ms (end to end 219.358 ms)
[10/07/2020-00:32:28] [I] percentile: 236.222 ms at 99% (end to end 236.231 ms at 99%)
[10/07/2020-00:32:28] [I] throughput: 4.5384 qps
[10/07/2020-00:32:28] [I] walltime: 3.52547 s
[10/07/2020-00:32:28] [I] Enqueue Time
[10/07/2020-00:32:28] [I] min: 1.19866 ms
[10/07/2020-00:32:28] [I] max: 2.07568 ms
[10/07/2020-00:32:28] [I] median: 1.56665 ms
[10/07/2020-00:32:28] [I] GPU Compute
[10/07/2020-00:32:28] [I] min: 212.686 ms
[10/07/2020-00:32:28] [I] max: 236.108 ms
[10/07/2020-00:32:28] [I] mean: 220.156 ms
[10/07/2020-00:32:28] [I] median: 219.107 ms
[10/07/2020-00:32:28] [I] percentile: 236.108 ms at 99%
[10/07/2020-00:32:28] [I] total compute time: 3.52249 s
&&&& PASSED TensorRT.trtexec # /usr/src/tensorrt/bin/trtexec --loadEngine=VGG16_faster_rcnn_final.caffemodel_b1_gpu0_fp32.engine --int8 --batch=1 --useSpinWait

so what I am getting is the normal speed of this model?

AastaLLL · October 7, 2020, 3:30am

Hi,

We are going to check this.

In case you don’t know, have you maximized the device performance first?

$ sudo nvpmodel -m 0
$ sudo jetson_clocks

Thanks.

cogbot · October 7, 2020, 3:56am

you mean

sudo nvpmodel -m 2 #for xavier nx

right?

AastaLLL · October 7, 2020, 4:42am

Hi,

For maximal GPU/CPU clock rate, it should be mode 0:
https://docs.nvidia.com/jetson/l4t/index.html#page/Tegra%2520Linux%2520Driver%2520Package%2520Development%2520Guide%2Fpower_management_jetson_xavier.html%23wwpID0E0NO0HA

Mode 2 is used for CPU intensive jobs since it enables all the CPU cores for usage.

After testing, it’s recommended to use INT8 for inference on XavierNX.
We can get around 15fps for objectDetector_FasterRCNN.

Please remember to update the config_infer_primary_fasterRCNN.txt with the following:

[property]
...
## 0=FP32, 1=INT8, 2=FP16 mode
network-mode=1

Please noticed the inference precision is decided when the building time.
In your testing with trtexec, the model is inference with fp32 mode rather than int8.

/usr/src/tensorrt/bin/trtexec --loadEngine=VGG16_faster_rcnn_final.caffemodel_b1_gpu0_fp32.engine --int8 --batch=1 --useSpinWait

Thanks.

Topic		Replies	Views
Nvidia/retinanet-examples Network running VERY slow on Jetson Xavier DeepStream SDK	9	1326	April 27, 2020
Slow inference using tensorrt sampleFasterRCNN, 320ms/frame Jetson TX2	5	1448	October 18, 2021
Lack of FPS after successfully deploy TLT to Deepstream. DeepStream SDK	18	1026	April 27, 2020
Slow object detection speed Xavier AGX 32GB Jetson AGX Xavier tensorrt , tensorflow	6	1261	October 18, 2021
Random Bounding Box in FasterRCNN etlt model in Xavier 30W Mode DeepStream SDK	26	1484	October 12, 2021
2 fps inference speed Jetson AGX Xavier	4	1861	October 18, 2021
Low FPS for Frcnn model DeepStream SDK	12	608	January 25, 2022
Faster R-CNN: too many resources requested for launch Jetson TX2	27	7233	September 14, 2018
Error while executing the fastest RCNN example on the tlt officialy provided docker in my intel computer TAO Toolkit tensorflow , docker	9	1032	October 12, 2021
Faster R-CNN on Jetson TX1 Jetson TX1	20	5094	October 18, 2021

[urgent] FasterRCNN example - running slowly

Related topics