ONNX model with Jetson-Inference using GPU

foreverneilyoung · April 15, 2021, 7:20pm

Holy sh… If this was like so in your tutorial, then yes… let me check

foreverneilyoung · April 15, 2021, 7:21pm

I didn’t provide a batch-size, so it has been 1 for sure… Will check that.

foreverneilyoung · April 15, 2021, 7:32pm

Yepp. This was the reason. The engine was re-created after I have re-created the ONNX model with batch-size=3. But this wasn’t the reason for the slow inference. The inference rate has been increased by one frame per camera, so all 3 cams are running now at 15 fps. And this with an MJPEG capture of 640x480.

Unfortunately this is a disappointing result after all these efforts. I would have bet, it would go through the ceiling…

EDIT: @dusty_nv Well, I really had the hope, it would be the parser, who would make it that lame. But even if I return nothing from it - the inference rate is 14.8 per cam… This is pretty much contradicting everything, I was intending to achieve…

What a pity…

dusty_nv · April 15, 2021, 7:55pm

What is the rate the cameras run simultaneously without inference?

Offhand I recall that the 9-class model has higher performance than that.

foreverneilyoung · April 15, 2021, 8:00pm

I’m pretty sure they will show up with plain 30 fps each, which is the capture rate. I cannot check it right now, since I would have to re-factor a good part of my code. But I just switched the model, from onnx to the default resnet10 -caffemodel. Full inference running for 4 classes (person, car, bicycle, roadsign) with exactly 28.4 on each camera while 3 cameras attached.

EDIT: I will re-train the model tomorrow to 1 or 2 fruits only. I suppose it will be faster then… For now I’m too disappointed…

EDIT2: I could give your 100 epoch a shot…

EDIT3: No, this has for sure been created for batch-size=1, I guess… Will not work for comparison.

EDIT4: Also resnet34-peoplenet runs at 25.4 each.

foreverneilyoung · April 15, 2021, 8:12pm

The strange thing is also, that I now HAVE to use 3 cameras. Each attempt to go back to just 1 or 2 ends up in an error. Here the situation for using a “batch-size=3” engine with just 1 camera. This switch back and forth works very well with the other models w/o any need to use another engine.

ERROR: [TRT]: Transpose_186: reshaping failed for tensor: 432
ERROR: [TRT]: shapeMachine.cpp (160) - Shape Error in executeReshape: reshape would change volume
ERROR: [TRT]: Instruction: RESHAPE{1 24 1 1} {3 1 1 24}
ERROR: Failed to enqueue trt inference batch
ERROR: Infer context enqueue buffer failed, nvinfer error:NVDSINFER_TENSORRT_ERROR
0:00:05.001359766 29388      0x354cb70 WARN                 nvinfer gstnvinfer.cpp:1225:gst_nvinfer_input_queue_loop:<primary-inference> error: Failed to queue input batch for inferencing
2021-04-15 22:10:08,503 inference.py    DEBUG   : Stream 0, FPS: 0.0
Error: gst-stream-error-quark: Failed to queue input batch for inferencing (1): /dvs/git/dirty/git-master_linux/deepstream/sdk/src/gst-plugins/gst-nvinfer/gstnvinfer.cpp(1225): gst_nvinfer_input_queue_loop (): /GstPipeline:pipeline0/GstNvInfer:primary-inference
ERROR: [TRT]: Transpose_186: reshaping failed for tensor: 432
ERROR: [TRT]: shapeMachine.cpp (160) - Shape Error in executeReshape: reshape would change volume
ERROR: [TRT]: Instruction: RESHAPE{1 24 1 1} {3 1 1 24}
ERROR: Failed to enqueue trt inference batch
ERROR: Infer context enqueue buffer failed, nvinfer error:NVDSINFER_TENSORRT_ERROR
0:00:05.019086167 29388      0x354cb70 WARN                 nvinfer gstnvinfer.cpp:1225:gst_nvinfer_input_queue_loop:<primary-inference> error: Failed to queue input batch for inferencing
ERROR: [TRT]: Transpose_186: reshaping failed for tensor: 432
ERROR: [TRT]: shapeMachine.cpp (160) - Shape Error in executeReshape: reshape would change volume
ERROR: [TRT]: Instruction: RESHAPE{1 24 1 1} {3 1 1 24}
ERROR: Failed to enqueue trt inference batch
ERROR: Infer context enqueue buffer failed, nvinfer error:NVDSINFER_TENSORRT_ERROR
0:00:05.057258980 29388      0x354cb70 WARN                 nvinfer gstnvinfer.cpp:1225:gst_nvinfer_input_queue_loop:<primary-inference> error: Failed to queue input batch for inferencing
ERROR: [TRT]: Transpose_186: reshaping failed for tensor: 432
ERROR: [TRT]: shapeMachine.cpp (160) - Shape Error in executeReshape: reshape would change volume
ERROR: [TRT]: Instruction: RESHAPE{1 24 1 1} {3 1 1 24}
ERROR: Failed to enqueue trt inference batch
ERROR: Infer context enqueue buffer failed, nvinfer error:NVDSINFER_TENSORRT_ERROR
0:00:05.087307282 29388      0x354cb70 WARN                 nvinfer gstnvinfer.cpp:1225:gst_nvinfer_input_queue_loop:<primary-inference> error: Failed to queue input batch for inferencing

dusty_nv · April 15, 2021, 8:22pm

Those may have been using dynamic axes for the batch dimension so they can change the batch size dynamically. I thought that the batch could be changed as long as it was less than the maximum it was supported with, but it appears not to work in this case. Without digging into the dynamic axes in PyTorch, the easiest would probably just be to export it three times for each batch size you want.

foreverneilyoung · April 15, 2021, 8:25pm

I see. This should be possible. But it is just a minor issue. I would love to have the full rate as with the other models…

dusty_nv · April 15, 2021, 8:55pm

I believe those models had been pruned with Transfer Learning Toolkit - which you may want to look into for training higher-performance detection models.

I quickly checked the performance again of my fruits model with trtexec on Xavier NX, and the mean latency for the DNN was 2.89ms, so I’m not sure where in the code you are seeing the reduced performance.

foreverneilyoung · April 15, 2021, 9:01pm

Is this “trtexec” also available on the Nano? I’m running on a nano

foreverneilyoung · April 15, 2021, 9:08pm

I’m also having these warning while creating the engine:

Not sure if this means something

WARNING: [TRT]: onnx2trt_utils.cpp:220: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
WARNING: [TRT]: onnx2trt_utils.cpp:246: One or more weights outside the range of INT32 was clamped
WARNING: [TRT]: onnx2trt_utils.cpp:246: One or more weights outside the range of INT32 was clamped
WARNING: [TRT]: onnx2trt_utils.cpp:246: One or more weights outside the range of INT32 was clamped

foreverneilyoung · April 15, 2021, 9:09pm

But the results are consistent over here:

One camera, 30 fps
Two cameras: 21.8 per cam
Three cameras: 14.9 per cam

And the latency is also visibly higher. Not much, but visible

dusty_nv · April 15, 2021, 9:16pm

Yes, it is under /usr/src/tensorrt/bin

Run it as: trtexec --onnx=/path/to/your/ssd-mobilenet.onnx --fp16

My testing was with the batch-size 1 model BTW

foreverneilyoung · April 15, 2021, 9:26pm

Tested with batch-size=2 model. Where can I see the results? Is it in the “Average on 10 runs” line? Then I mostly have a GPU latency of 22.3 ms

EDIT: With the batch-size =1 model

foreverneilyoung · April 15, 2021, 9:28pm

Here the full result:

/ usr/src/tensorrt/bin/trtexec --onnx=ssd-mobilenet-b1.onnx --fp16
&&&& RUNNING TensorRT.trtexec # /usr/src/tensorrt/bin/trtexec --onnx=ssd-mobilenet-b1.onnx --fp16
[04/15/2021-23:19:10] [I] === Model Options ===
[04/15/2021-23:19:10] [I] Format: ONNX
[04/15/2021-23:19:10] [I] Model: ssd-mobilenet-b1.onnx
[04/15/2021-23:19:10] [I] Output:
[04/15/2021-23:19:10] [I] === Build Options ===
[04/15/2021-23:19:10] [I] Max batch: 1
[04/15/2021-23:19:10] [I] Workspace: 16 MB
[04/15/2021-23:19:10] [I] minTiming: 1
[04/15/2021-23:19:10] [I] avgTiming: 8
[04/15/2021-23:19:10] [I] Precision: FP32+FP16
[04/15/2021-23:19:10] [I] Calibration:
[04/15/2021-23:19:10] [I] Safe mode: Disabled
[04/15/2021-23:19:10] [I] Save engine:
[04/15/2021-23:19:10] [I] Load engine:
[04/15/2021-23:19:10] [I] Builder Cache: Enabled
[04/15/2021-23:19:10] [I] NVTX verbosity: 0
[04/15/2021-23:19:10] [I] Inputs format: fp32:CHW
[04/15/2021-23:19:10] [I] Outputs format: fp32:CHW
[04/15/2021-23:19:10] [I] Input build shapes: model
[04/15/2021-23:19:10] [I] Input calibration shapes: model
[04/15/2021-23:19:10] [I] === System Options ===
[04/15/2021-23:19:10] [I] Device: 0
[04/15/2021-23:19:10] [I] DLACore:
[04/15/2021-23:19:10] [I] Plugins:
[04/15/2021-23:19:10] [I] === Inference Options ===
[04/15/2021-23:19:10] [I] Batch: 1
[04/15/2021-23:19:10] [I] Input inference shapes: model
[04/15/2021-23:19:10] [I] Iterations: 10
[04/15/2021-23:19:10] [I] Duration: 3s (+ 200ms warm up)
[04/15/2021-23:19:10] [I] Sleep time: 0ms
[04/15/2021-23:19:10] [I] Streams: 1
[04/15/2021-23:19:10] [I] ExposeDMA: Disabled
[04/15/2021-23:19:10] [I] Spin-wait: Disabled
[04/15/2021-23:19:10] [I] Multithreading: Disabled
[04/15/2021-23:19:10] [I] CUDA Graph: Disabled
[04/15/2021-23:19:10] [I] Skip inference: Disabled
[04/15/2021-23:19:10] [I] Inputs:
[04/15/2021-23:19:10] [I] === Reporting Options ===
[04/15/2021-23:19:10] [I] Verbose: Disabled
[04/15/2021-23:19:10] [I] Averages: 10 inferences
[04/15/2021-23:19:10] [I] Percentile: 99
[04/15/2021-23:19:10] [I] Dump output: Disabled
[04/15/2021-23:19:10] [I] Profile: Disabled
[04/15/2021-23:19:10] [I] Export timing to JSON file:
[04/15/2021-23:19:10] [I] Export output to JSON file:
[04/15/2021-23:19:10] [I] Export profile to JSON file:
[04/15/2021-23:19:10] [I]
----------------------------------------------------------------
Input filename: ssd-mobilenet-b1.onnx
ONNX IR version: 0.0.6
Opset version: 9
Producer name: pytorch
Producer version: 1.6
Domain:
Model version: 0
Doc string:
----------------------------------------------------------------
[04/15/2021-23:19:12] [W] [TRT] onnx2trt_utils.cpp:220: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
[04/15/2021-23:19:12] [W] [TRT] onnx2trt_utils.cpp:246: One or more weights outside the range of INT32 was clamped
[04/15/2021-23:19:12] [W] [TRT] onnx2trt_utils.cpp:246: One or more weights outside the range of INT32 was clamped
[04/15/2021-23:19:12] [W] [TRT] onnx2trt_utils.cpp:246: One or more weights outside the range of INT32 was clamped
[04/15/2021-23:19:47] [I] [TRT] Some tactics do not have sufficient workspace memory to run. Increasing workspace size may increase performance, please check verbose output.
[04/15/2021-23:24:20] [I] [TRT] Detected 1 inputs and 4 output network tensors.
[04/15/2021-23:24:20] [I] Starting inference threads
[04/15/2021-23:24:23] [I] Warmup completed 9 queries over 200 ms
[04/15/2021-23:24:23] [I] Timing trace has 135 queries over 3.03499 s
[04/15/2021-23:24:23] [I] Trace averages of 10 runs:
[04/15/2021-23:24:23] [I] Average on 10 runs - GPU latency: 22.4366 ms - Host latency: 22.5678 ms (end to end 22.6137 ms, enqueue 3.63834 ms)
[04/15/2021-23:24:23] [I] Average on 10 runs - GPU latency: 22.2926 ms - Host latency: 22.4237 ms (end to end 22.437 ms, enqueue 3.72401 ms)
[04/15/2021-23:24:23] [I] Average on 10 runs - GPU latency: 22.2817 ms - Host latency: 22.4117 ms (end to end 22.425 ms, enqueue 3.66209 ms)
[04/15/2021-23:24:23] [I] Average on 10 runs - GPU latency: 22.3675 ms - Host latency: 22.4974 ms (end to end 22.5107 ms, enqueue 3.62407 ms)
[04/15/2021-23:24:23] [I] Average on 10 runs - GPU latency: 22.2945 ms - Host latency: 22.4258 ms (end to end 22.4391 ms, enqueue 3.75123 ms)
[04/15/2021-23:24:23] [I] Average on 10 runs - GPU latency: 22.358 ms - Host latency: 22.4907 ms (end to end 22.5039 ms, enqueue 4.12003 ms)
[04/15/2021-23:24:23] [I] Average on 10 runs - GPU latency: 22.3997 ms - Host latency: 22.5304 ms (end to end 22.5439 ms, enqueue 3.58501 ms)
[04/15/2021-23:24:23] [I] Average on 10 runs - GPU latency: 22.2978 ms - Host latency: 22.4282 ms (end to end 22.4414 ms, enqueue 3.46478 ms)
[04/15/2021-23:24:23] [I] Average on 10 runs - GPU latency: 22.3844 ms - Host latency: 22.5147 ms (end to end 22.5287 ms, enqueue 3.64071 ms)
[04/15/2021-23:24:23] [I] Average on 10 runs - GPU latency: 22.2788 ms - Host latency: 22.4082 ms (end to end 22.4213 ms, enqueue 3.64285 ms)
[04/15/2021-23:24:23] [I] Average on 10 runs - GPU latency: 22.285 ms - Host latency: 22.4155 ms (end to end 22.428 ms, enqueue 3.71719 ms)
[04/15/2021-23:24:23] [I] Average on 10 runs - GPU latency: 22.3495 ms - Host latency: 22.4802 ms (end to end 22.528 ms, enqueue 3.74258 ms)
[04/15/2021-23:24:23] [I] Average on 10 runs - GPU latency: 22.3079 ms - Host latency: 22.4384 ms (end to end 22.4515 ms, enqueue 3.50591 ms)
[04/15/2021-23:24:23] [I] Host Latency
[04/15/2021-23:24:23] [I] min: 22.3263 ms (end to end 22.339 ms)
[04/15/2021-23:24:23] [I] max: 23.9305 ms (end to end 24.2707 ms)
[04/15/2021-23:24:23] [I] mean: 22.4626 ms (end to end 22.4808 ms)
[04/15/2021-23:24:23] [I] median: 22.4178 ms (end to end 22.4315 ms)
[04/15/2021-23:24:23] [I] percentile: 23.4402 ms at 99% (end to end 23.4531 ms at 99%)
[04/15/2021-23:24:23] [I] throughput: 44.4812 qps
[04/15/2021-23:24:23] [I] walltime: 3.03499 s
[04/15/2021-23:24:23] [I] Enqueue Time
[04/15/2021-23:24:23] [I] min: 3.34546 ms
[04/15/2021-23:24:23] [I] max: 4.59851 ms
[04/15/2021-23:24:23] [I] median: 3.61893 ms
[04/15/2021-23:24:23] [I] GPU Compute
[04/15/2021-23:24:23] [I] min: 22.1954 ms
[04/15/2021-23:24:23] [I] max: 23.7953 ms
[04/15/2021-23:24:23] [I] mean: 22.3319 ms
[04/15/2021-23:24:23] [I] median: 22.2871 ms
[04/15/2021-23:24:23] [I] percentile: 23.3093 ms at 99%
[04/15/2021-23:24:23] [I] total compute time: 3.01481 s
&&&& PASSED TensorRT.trtexec # /usr/src/tensorrt/bin/trtexec --onnx=ssd-mobilenet-b1.onnx --fp16

foreverneilyoung · April 15, 2021, 9:41pm

43 ms with batch-size=2

EDIT: Could I cross-check the other models too with this tool? resnet10-caffemodel and the resnet34-peoplenet?

EDIT2: 63.6 for b3

dusty_nv · April 16, 2021, 1:02am

If you have the models or TRT engine file for these, then yes you can use trtexec to profile it.

Your results on Nano would appear consistent with the runtime performance you are getting in DeepStream, so you can try further reducing the number of classes, or using TLT to train your model (which can prune it for faster performance)

foreverneilyoung · April 16, 2021, 5:31am

If you have the models or TRT engine file for these, then yes you can use trtexec to profile it.

Yes, I think so. I’m just insecure regarding the parameter specification for this tool. For resnet10-caffemodel I have two files: The model file resnet10.caffemodel and a proto-file resnet10.prototxt. For the resnet34-peoplenet one, the resnet34_peoplenet_pruned.etlt.

Your results on Nano would appear consistent with the runtime performance you are getting in DeepStream, so you can try further reducing the number of classes, or using TLT to train your model (which can prune it for faster performance)

OK, thanks for the confirmation. That means, that you see no major flaw in my training (?).

a) Would the fact, that I only used 1600 image (there have not been more to download) and 40 epochs be a reason for the poor results, which are one magnitude lower than yours

or

b) Would you think, this is really the gap between Nano and Xavier?

And if allowed as additional question: For a further reduction of objects, what is deciding? The “labels.txt” file or the number of items found in data/fruit? I looked like download.py does not really anything, if just called with less --class-names, even though it reports to download something.

Topic		Replies	Views
DeepStream 5.1, PyTorch, MobileNet SSD v1, retained, ONNX - poor performance DeepStream SDK	8	1890	October 12, 2021
TensorFlow EfficientDet-D0 -> ONNX -> TensorRT converted model fails to run in Deepstream DeepStream SDK deepstream61	8	1154	August 11, 2022
Onnx to trt engine DeepStream SDK	5	978	October 12, 2021
Hello AI World - now supports Python and onboard training with PyTorch! Jetson Nano	95	8981	July 18, 2022
Issues while converting ONNX to TRT Jetson Nano tensorrt , onnx	9	1459	October 18, 2021
ONNX Model Inference on Jetson Nano - Segmentation fault Jetson Nano tensorrt , jetson-inference	8	1552	October 15, 2021
Deepstream_test_3.py using your own custom model Jetson Xavier NX jetson-inference	4	662	October 18, 2021
Regarding doubts about deepstream custom parser for onnx with deepstream batch DeepStream SDK gstreamer , deepstream	5	170	September 14, 2024
Xavier NX 16 and 4 cameras with jetson-inference - some common questions Jetson Xavier NX camera , jetson-inference , gstreamer , python	76	3538	March 13, 2023
Nvidia/retinanet-examples Network running VERY slow on Jetson Xavier DeepStream SDK	9	1432	April 27, 2020

ONNX model with Jetson-Inference using GPU

Related topics