Holy sh… If this was like so in your tutorial, then yes… let me check
I didn’t provide a batch-size, so it has been 1 for sure… Will check that.
Yepp. This was the reason. The engine was re-created after I have re-created the ONNX model with batch-size=3. But this wasn’t the reason for the slow inference. The inference rate has been increased by one frame per camera, so all 3 cams are running now at 15 fps. And this with an MJPEG capture of 640x480.
Unfortunately this is a disappointing result after all these efforts. I would have bet, it would go through the ceiling…
EDIT: @dusty_nv Well, I really had the hope, it would be the parser, who would make it that lame. But even if I return nothing from it - the inference rate is 14.8 per cam… This is pretty much contradicting everything, I was intending to achieve…
What a pity…
What is the rate the cameras run simultaneously without inference?
Offhand I recall that the 9-class model has higher performance than that.
I’m pretty sure they will show up with plain 30 fps each, which is the capture rate. I cannot check it right now, since I would have to re-factor a good part of my code. But I just switched the model, from onnx to the default resnet10 -caffemodel. Full inference running for 4 classes (person, car, bicycle, roadsign) with exactly 28.4 on each camera while 3 cameras attached.
EDIT: I will re-train the model tomorrow to 1 or 2 fruits only. I suppose it will be faster then… For now I’m too disappointed…
EDIT2: I could give your 100 epoch a shot…
EDIT3: No, this has for sure been created for batch-size=1, I guess… Will not work for comparison.
EDIT4: Also resnet34-peoplenet runs at 25.4 each.
The strange thing is also, that I now HAVE to use 3 cameras. Each attempt to go back to just 1 or 2 ends up in an error. Here the situation for using a “batch-size=3” engine with just 1 camera. This switch back and forth works very well with the other models w/o any need to use another engine.
ERROR: [TRT]: Transpose_186: reshaping failed for tensor: 432
ERROR: [TRT]: shapeMachine.cpp (160) - Shape Error in executeReshape: reshape would change volume
ERROR: [TRT]: Instruction: RESHAPE{1 24 1 1} {3 1 1 24}
ERROR: Failed to enqueue trt inference batch
ERROR: Infer context enqueue buffer failed, nvinfer error:NVDSINFER_TENSORRT_ERROR
0:00:05.001359766 29388 0x354cb70 WARN nvinfer gstnvinfer.cpp:1225:gst_nvinfer_input_queue_loop:<primary-inference> error: Failed to queue input batch for inferencing
2021-04-15 22:10:08,503 inference.py DEBUG : Stream 0, FPS: 0.0
Error: gst-stream-error-quark: Failed to queue input batch for inferencing (1): /dvs/git/dirty/git-master_linux/deepstream/sdk/src/gst-plugins/gst-nvinfer/gstnvinfer.cpp(1225): gst_nvinfer_input_queue_loop (): /GstPipeline:pipeline0/GstNvInfer:primary-inference
ERROR: [TRT]: Transpose_186: reshaping failed for tensor: 432
ERROR: [TRT]: shapeMachine.cpp (160) - Shape Error in executeReshape: reshape would change volume
ERROR: [TRT]: Instruction: RESHAPE{1 24 1 1} {3 1 1 24}
ERROR: Failed to enqueue trt inference batch
ERROR: Infer context enqueue buffer failed, nvinfer error:NVDSINFER_TENSORRT_ERROR
0:00:05.019086167 29388 0x354cb70 WARN nvinfer gstnvinfer.cpp:1225:gst_nvinfer_input_queue_loop:<primary-inference> error: Failed to queue input batch for inferencing
ERROR: [TRT]: Transpose_186: reshaping failed for tensor: 432
ERROR: [TRT]: shapeMachine.cpp (160) - Shape Error in executeReshape: reshape would change volume
ERROR: [TRT]: Instruction: RESHAPE{1 24 1 1} {3 1 1 24}
ERROR: Failed to enqueue trt inference batch
ERROR: Infer context enqueue buffer failed, nvinfer error:NVDSINFER_TENSORRT_ERROR
0:00:05.057258980 29388 0x354cb70 WARN nvinfer gstnvinfer.cpp:1225:gst_nvinfer_input_queue_loop:<primary-inference> error: Failed to queue input batch for inferencing
ERROR: [TRT]: Transpose_186: reshaping failed for tensor: 432
ERROR: [TRT]: shapeMachine.cpp (160) - Shape Error in executeReshape: reshape would change volume
ERROR: [TRT]: Instruction: RESHAPE{1 24 1 1} {3 1 1 24}
ERROR: Failed to enqueue trt inference batch
ERROR: Infer context enqueue buffer failed, nvinfer error:NVDSINFER_TENSORRT_ERROR
0:00:05.087307282 29388 0x354cb70 WARN nvinfer gstnvinfer.cpp:1225:gst_nvinfer_input_queue_loop:<primary-inference> error: Failed to queue input batch for inferencing
Those may have been using dynamic axes for the batch dimension so they can change the batch size dynamically. I thought that the batch could be changed as long as it was less than the maximum it was supported with, but it appears not to work in this case. Without digging into the dynamic axes in PyTorch, the easiest would probably just be to export it three times for each batch size you want.
I see. This should be possible. But it is just a minor issue. I would love to have the full rate as with the other models…
I believe those models had been pruned with Transfer Learning Toolkit - which you may want to look into for training higher-performance detection models.
I quickly checked the performance again of my fruits model with trtexec on Xavier NX, and the mean latency for the DNN was 2.89ms, so I’m not sure where in the code you are seeing the reduced performance.
Is this “trtexec” also available on the Nano? I’m running on a nano
I’m also having these warning while creating the engine:
Not sure if this means something
WARNING: [TRT]: onnx2trt_utils.cpp:220: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
WARNING: [TRT]: onnx2trt_utils.cpp:246: One or more weights outside the range of INT32 was clamped
WARNING: [TRT]: onnx2trt_utils.cpp:246: One or more weights outside the range of INT32 was clamped
WARNING: [TRT]: onnx2trt_utils.cpp:246: One or more weights outside the range of INT32 was clamped
But the results are consistent over here:
One camera, 30 fps
Two cameras: 21.8 per cam
Three cameras: 14.9 per cam
And the latency is also visibly higher. Not much, but visible
Yes, it is under /usr/src/tensorrt/bin
Run it as: trtexec --onnx=/path/to/your/ssd-mobilenet.onnx --fp16
My testing was with the batch-size 1 model BTW
Tested with batch-size=2 model. Where can I see the results? Is it in the “Average on 10 runs” line? Then I mostly have a GPU latency of 22.3 ms
EDIT: With the batch-size =1 model
Here the full result:
/ usr/src/tensorrt/bin/trtexec --onnx=ssd-mobilenet-b1.onnx --fp16
&&&& RUNNING TensorRT.trtexec # /usr/src/tensorrt/bin/trtexec --onnx=ssd-mobilenet-b1.onnx --fp16
[04/15/2021-23:19:10] [I] === Model Options ===
[04/15/2021-23:19:10] [I] Format: ONNX
[04/15/2021-23:19:10] [I] Model: ssd-mobilenet-b1.onnx
[04/15/2021-23:19:10] [I] Output:
[04/15/2021-23:19:10] [I] === Build Options ===
[04/15/2021-23:19:10] [I] Max batch: 1
[04/15/2021-23:19:10] [I] Workspace: 16 MB
[04/15/2021-23:19:10] [I] minTiming: 1
[04/15/2021-23:19:10] [I] avgTiming: 8
[04/15/2021-23:19:10] [I] Precision: FP32+FP16
[04/15/2021-23:19:10] [I] Calibration:
[04/15/2021-23:19:10] [I] Safe mode: Disabled
[04/15/2021-23:19:10] [I] Save engine:
[04/15/2021-23:19:10] [I] Load engine:
[04/15/2021-23:19:10] [I] Builder Cache: Enabled
[04/15/2021-23:19:10] [I] NVTX verbosity: 0
[04/15/2021-23:19:10] [I] Inputs format: fp32:CHW
[04/15/2021-23:19:10] [I] Outputs format: fp32:CHW
[04/15/2021-23:19:10] [I] Input build shapes: model
[04/15/2021-23:19:10] [I] Input calibration shapes: model
[04/15/2021-23:19:10] [I] === System Options ===
[04/15/2021-23:19:10] [I] Device: 0
[04/15/2021-23:19:10] [I] DLACore:
[04/15/2021-23:19:10] [I] Plugins:
[04/15/2021-23:19:10] [I] === Inference Options ===
[04/15/2021-23:19:10] [I] Batch: 1
[04/15/2021-23:19:10] [I] Input inference shapes: model
[04/15/2021-23:19:10] [I] Iterations: 10
[04/15/2021-23:19:10] [I] Duration: 3s (+ 200ms warm up)
[04/15/2021-23:19:10] [I] Sleep time: 0ms
[04/15/2021-23:19:10] [I] Streams: 1
[04/15/2021-23:19:10] [I] ExposeDMA: Disabled
[04/15/2021-23:19:10] [I] Spin-wait: Disabled
[04/15/2021-23:19:10] [I] Multithreading: Disabled
[04/15/2021-23:19:10] [I] CUDA Graph: Disabled
[04/15/2021-23:19:10] [I] Skip inference: Disabled
[04/15/2021-23:19:10] [I] Inputs:
[04/15/2021-23:19:10] [I] === Reporting Options ===
[04/15/2021-23:19:10] [I] Verbose: Disabled
[04/15/2021-23:19:10] [I] Averages: 10 inferences
[04/15/2021-23:19:10] [I] Percentile: 99
[04/15/2021-23:19:10] [I] Dump output: Disabled
[04/15/2021-23:19:10] [I] Profile: Disabled
[04/15/2021-23:19:10] [I] Export timing to JSON file:
[04/15/2021-23:19:10] [I] Export output to JSON file:
[04/15/2021-23:19:10] [I] Export profile to JSON file:
[04/15/2021-23:19:10] [I]
----------------------------------------------------------------
Input filename: ssd-mobilenet-b1.onnx
ONNX IR version: 0.0.6
Opset version: 9
Producer name: pytorch
Producer version: 1.6
Domain:
Model version: 0
Doc string:
----------------------------------------------------------------
[04/15/2021-23:19:12] [W] [TRT] onnx2trt_utils.cpp:220: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
[04/15/2021-23:19:12] [W] [TRT] onnx2trt_utils.cpp:246: One or more weights outside the range of INT32 was clamped
[04/15/2021-23:19:12] [W] [TRT] onnx2trt_utils.cpp:246: One or more weights outside the range of INT32 was clamped
[04/15/2021-23:19:12] [W] [TRT] onnx2trt_utils.cpp:246: One or more weights outside the range of INT32 was clamped
[04/15/2021-23:19:47] [I] [TRT] Some tactics do not have sufficient workspace memory to run. Increasing workspace size may increase performance, please check verbose output.
[04/15/2021-23:24:20] [I] [TRT] Detected 1 inputs and 4 output network tensors.
[04/15/2021-23:24:20] [I] Starting inference threads
[04/15/2021-23:24:23] [I] Warmup completed 9 queries over 200 ms
[04/15/2021-23:24:23] [I] Timing trace has 135 queries over 3.03499 s
[04/15/2021-23:24:23] [I] Trace averages of 10 runs:
[04/15/2021-23:24:23] [I] Average on 10 runs - GPU latency: 22.4366 ms - Host latency: 22.5678 ms (end to end 22.6137 ms, enqueue 3.63834 ms)
[04/15/2021-23:24:23] [I] Average on 10 runs - GPU latency: 22.2926 ms - Host latency: 22.4237 ms (end to end 22.437 ms, enqueue 3.72401 ms)
[04/15/2021-23:24:23] [I] Average on 10 runs - GPU latency: 22.2817 ms - Host latency: 22.4117 ms (end to end 22.425 ms, enqueue 3.66209 ms)
[04/15/2021-23:24:23] [I] Average on 10 runs - GPU latency: 22.3675 ms - Host latency: 22.4974 ms (end to end 22.5107 ms, enqueue 3.62407 ms)
[04/15/2021-23:24:23] [I] Average on 10 runs - GPU latency: 22.2945 ms - Host latency: 22.4258 ms (end to end 22.4391 ms, enqueue 3.75123 ms)
[04/15/2021-23:24:23] [I] Average on 10 runs - GPU latency: 22.358 ms - Host latency: 22.4907 ms (end to end 22.5039 ms, enqueue 4.12003 ms)
[04/15/2021-23:24:23] [I] Average on 10 runs - GPU latency: 22.3997 ms - Host latency: 22.5304 ms (end to end 22.5439 ms, enqueue 3.58501 ms)
[04/15/2021-23:24:23] [I] Average on 10 runs - GPU latency: 22.2978 ms - Host latency: 22.4282 ms (end to end 22.4414 ms, enqueue 3.46478 ms)
[04/15/2021-23:24:23] [I] Average on 10 runs - GPU latency: 22.3844 ms - Host latency: 22.5147 ms (end to end 22.5287 ms, enqueue 3.64071 ms)
[04/15/2021-23:24:23] [I] Average on 10 runs - GPU latency: 22.2788 ms - Host latency: 22.4082 ms (end to end 22.4213 ms, enqueue 3.64285 ms)
[04/15/2021-23:24:23] [I] Average on 10 runs - GPU latency: 22.285 ms - Host latency: 22.4155 ms (end to end 22.428 ms, enqueue 3.71719 ms)
[04/15/2021-23:24:23] [I] Average on 10 runs - GPU latency: 22.3495 ms - Host latency: 22.4802 ms (end to end 22.528 ms, enqueue 3.74258 ms)
[04/15/2021-23:24:23] [I] Average on 10 runs - GPU latency: 22.3079 ms - Host latency: 22.4384 ms (end to end 22.4515 ms, enqueue 3.50591 ms)
[04/15/2021-23:24:23] [I] Host Latency
[04/15/2021-23:24:23] [I] min: 22.3263 ms (end to end 22.339 ms)
[04/15/2021-23:24:23] [I] max: 23.9305 ms (end to end 24.2707 ms)
[04/15/2021-23:24:23] [I] mean: 22.4626 ms (end to end 22.4808 ms)
[04/15/2021-23:24:23] [I] median: 22.4178 ms (end to end 22.4315 ms)
[04/15/2021-23:24:23] [I] percentile: 23.4402 ms at 99% (end to end 23.4531 ms at 99%)
[04/15/2021-23:24:23] [I] throughput: 44.4812 qps
[04/15/2021-23:24:23] [I] walltime: 3.03499 s
[04/15/2021-23:24:23] [I] Enqueue Time
[04/15/2021-23:24:23] [I] min: 3.34546 ms
[04/15/2021-23:24:23] [I] max: 4.59851 ms
[04/15/2021-23:24:23] [I] median: 3.61893 ms
[04/15/2021-23:24:23] [I] GPU Compute
[04/15/2021-23:24:23] [I] min: 22.1954 ms
[04/15/2021-23:24:23] [I] max: 23.7953 ms
[04/15/2021-23:24:23] [I] mean: 22.3319 ms
[04/15/2021-23:24:23] [I] median: 22.2871 ms
[04/15/2021-23:24:23] [I] percentile: 23.3093 ms at 99%
[04/15/2021-23:24:23] [I] total compute time: 3.01481 s
&&&& PASSED TensorRT.trtexec # /usr/src/tensorrt/bin/trtexec --onnx=ssd-mobilenet-b1.onnx --fp16
43 ms with batch-size=2
EDIT: Could I cross-check the other models too with this tool? resnet10-caffemodel and the resnet34-peoplenet?
EDIT2: 63.6 for b3
If you have the models or TRT engine file for these, then yes you can use trtexec to profile it.
Your results on Nano would appear consistent with the runtime performance you are getting in DeepStream, so you can try further reducing the number of classes, or using TLT to train your model (which can prune it for faster performance)
If you have the models or TRT engine file for these, then yes you can use trtexec to profile it.
Yes, I think so. I’m just insecure regarding the parameter specification for this tool. For resnet10-caffemodel I have two files: The model file resnet10.caffemodel and a proto-file resnet10.prototxt. For the resnet34-peoplenet one, the resnet34_peoplenet_pruned.etlt.
Your results on Nano would appear consistent with the runtime performance you are getting in DeepStream, so you can try further reducing the number of classes, or using TLT to train your model (which can prune it for faster performance)
OK, thanks for the confirmation. That means, that you see no major flaw in my training (?).
a) Would the fact, that I only used 1600 image (there have not been more to download) and 40 epochs be a reason for the poor results, which are one magnitude lower than yours
or
b) Would you think, this is really the gap between Nano and Xavier?
And if allowed as additional question: For a further reduction of objects, what is deciding? The “labels.txt” file or the number of items found in data/fruit? I looked like download.py does not really anything, if just called with less --class-names, even though it reports to download something.