Jetson-inference - running a custom semantic segmentation model

f.kosec · December 21, 2020, 8:29am

Hello!

I’m trying to run my custom ResNet based model with the help of jetson-inference. The model is trained in PyTorch 1.7.0, and then exported to ONNX opset version 11. I’m able to benchmark it with trtexec, but when using ./segnet-console or ./segnet-console.py with appropriate arguments for model, input_blob, output_blob, labels, and colors, I’m getting the following error:

[TRT]    binding to input 0 image.1  binding index:  0
[TRT]    binding to input 0 image.1  dims (b=1 c=3 h=1024 w=2048) size=25165824
[TRT]    binding to output 0 391  binding index:  8
[TRT]    binding to output 0 391  dims (b=1 c=12 h=1024 w=2048) size=100663296
[TRT]
[TRT]    device GPU, /home/user/models/file_opset11_2048x1024.onnx initialized.
[TRT]    segNet outputs -- s_w 2048  s_h 1024  s_c 12
[image] loaded 'images/warehouse.jpg'  (2048x1024, 3 channels)
[TRT]    ../rtSafe/cuda/cudaConvolutionRunner.cpp (457) - Cudnn Error in execute: 8 (CUDNN_STATUS_EXECUTION_FAILED)
[TRT]    FAILED_EXECUTION: std::exception
[TRT]    failed to execute TensorRT context on device GPU
segnet:  failed to process segmentation
[image] imageLoader -- End of Stream (EOS) has been reached, stream has been closed
segnet:  shutting down...
[cuda]      an illegal memory access was encountered (error 700) (hex 0x2BC)
[cuda]      /home/user/dev/jetson-inference/utils/image/imageLoader.cpp:105
[TRT]    ../rtSafe/safeRuntime.cpp (32) - Cuda Error in free: 700 (an illegal memory access was encountered)
terminate called after throwing an instance of 'nvinfer1::CudaError'
  what():  std::exception
[1]    18973 abort (core dumped)  ./segnet-console --model=/home/user/models/file_opset11_2048x1024.onnx

Could you please advise me on how to successfully solve this issue? Thanks!

dusty_nv · December 21, 2020, 7:20pm

Hi @f.kosec, did you run trtexec with --fp16 flag to make sure your custom model can run with FP16 enabled?

You should also check that the pre-processing that I do (for my FCN-ResNet18 models) is the same that your PyTorch model does:

jetson-inference/segNet.cpp at 0fd4a76101be5c9af4723a9e8376018dfa462be4 · dusty-nv/jetson-inference · GitHub

Also, it appears your model is internally performing the deconvolution (because the output dims == input dims), so when you process the overlay you want to use point filtering (i.e. run segnet program with --filter-mode=point flag). This is so it doesn’t needlessly perform bilinear interpolation on the output. I remove the deconv layer from my FCN-ResNet models because it’s linear and faster in my bilinear interpolation kernel.

f.kosec · February 22, 2021, 2:34pm

Hi again, and sorry for my late response.

The problem was the wrong conversion from PyTorch to ONNX.

Thanks for your additional suggestion about the deconvolution, I’ll check that out as well.

f.kosec · March 13, 2021, 7:58pm

@dusty_nv could you please have a look - what might be happening here, and what I seem to be missing / doing wrong?

When running segnet-console.py like this, everything works:

./segnet-console.py --network=FCN-ResNet18-Cityscapes-2048x1024 /home/user/data/input.png /home/user/data/out.png

However, when I try to run the built engine manually like this:

./segnet-console.py --model=/home/user/development/jetson-inference/data/networks/FCN-ResNet18-Cityscapes-2048x1024/fcn_resnet18.onnx.1.1.7103.GPU.FP16.engine --labels=/home/user/development/jetson-inference/data/networks/FCN-ResNet18-Cityscapes-2048x1024/classes.txt --colors=/home/user/development/jetson-inference/data/networks/FCN-ResNet18-Cityscapes-2048x1024/colors.txt --input_blob=input_0 --output_blob=output_0 /home/user/data/input.png /home/user/data/out.png

… the output image is wrong, and I can see interesting differences in the console, namely c, h, and w are wrong:

[TRT]    binding to input 0 input_0  dims (b=1 c=1 h=3 w=1024) size=12288
[TRT]    binding to output 0 output_0  dims (b=1 c=1 h=21 w=32) size=2688

Attached are the full console logs.

Also, when I try trtexec with --fp16, it passes, but one of the line says: “[I] Precision: FP32+FP16” and “[I] Inputs format: fp32:CHW, [I] Outputs format: fp32:CHW”

model.txt (8.2 KB) network.txt (8.0 KB)

dusty_nv · March 14, 2021, 6:43pm

Have you tried just specifying the ONNX model to the --model argument? It will select the .engine automatically if it exists, otherwise it will build it. You can put your custom model in a new directory.

f.kosec · April 13, 2021, 10:16pm

I managed to get everything working really well with your help, regarding the inference output. Everything looks great (after editing the pre-processing you mentioned before). Thanks @dusty_nv! Your help has been invaluable!

One thing remains; maybe you can point me in the right direction. With jetson_clocks run, I ran benchmarks with trtexec and jetson-inference on my model and FCN-ResNet18-Cityscapes-2048x1024.

trtexec results:
FCN-ResNet18-Cityscapes-2048x1024 - 22.83ms
CUSTOM-ResNet18-Model-2048x1024 - 46.00ms

segnet.py results:
FCN-ResNet18-Cityscapes-2048x1024 - Total CPU 24.54ms CUDA 25.56ms
CUSTOM-FCN-ResNet18-Model-2048x1024 - Total CPU 926.48ms CUDA 926.67ms

How come segnet.py results differ so much, when trtexec reports them to be closer in value?

FCN-ResNet18-Cityscapes-2048x1024

[I] Host Latency
[I] min: 22.5043 ms (end to end 22.5135 ms)
[I] max: 23.5806 ms (end to end 23.5916 ms)
[I] mean: 23.118 ms (end to end 23.1271 ms)
[I] median: 23.1215 ms (end to end 23.1316 ms)
[I] percentile: 23.5413 ms at 99% (end to end 23.5557 ms at 99%)
[I] throughput: 0 qps
[I] walltime: 3.05278 s
[I] Enqueue Time
[I] min: 0.233643 ms
[I] max: 0.768311 ms
[I] median: 0.32074 ms
[I] GPU Compute
[I] min: 21.8194 ms
[I] max: 22.8459 ms
[I] mean: 22.4053 ms
[I] median: 22.4059 ms
[I] percentile: 22.8337 ms at 99%
[I] total compute time: 2.9575 s
&&&& PASSED TensorRT.trtexec # trtexec --loadEngine=fcn_resnet18.onnx.1.1.7103.GPU.FP16.engine --fp16 --shapes=image.1:3x1024x2048 --explicitBatch

./segnet.py --network=fcn-resnet18-cityscapes-2048x1024 in.png out.png

...

[TRT]    CUDA engine context initialized on device GPU:
[TRT]       -- layers       25
[TRT]       -- maxBatchSize 1
[TRT]       -- workspace    0
[TRT]       -- deviceMemory 132862976
[TRT]       -- bindings     2
...
[TRT]    binding to input 0 input_0  binding index:  0
[TRT]    binding to input 0 input_0  dims (b=1 c=3 h=1024 w=2048) size=25165824
[TRT]    binding to output 0 output_0  binding index:  1
[TRT]    binding to output 0 output_0  dims (b=1 c=21 h=32 w=64) size=172032

...

[TRT]    ------------------------------------------------
[TRT]    Timing Report networks/FCN-ResNet18-Cityscapes-2048x1024/fcn_resnet18.onnx
[TRT]    ------------------------------------------------
[TRT]    Pre-Process   CPU   0.06714ms  CUDA   0.43469ms
[TRT]    Network       CPU  23.90803ms  CUDA  23.33222ms
[TRT]    Post-Process  CPU   0.54134ms  CUDA   0.53978ms
[TRT]    Visualize     CPU   0.02618ms  CUDA   0.25395ms
[TRT]    Total         CPU  24.54268ms  CUDA  24.56064ms
[TRT]    ------------------------------------------------

CUSTOM-FCN-ResNet18-Model-2048x1024

[04/13/2021-22:41:26] [I] Host Latency
[04/13/2021-22:41:26] [I] min: 48.5713 ms (end to end 48.5793 ms)
[04/13/2021-22:41:26] [I] max: 49.9407 ms (end to end 49.9504 ms)
[04/13/2021-22:41:26] [I] mean: 49.2631 ms (end to end 49.3496 ms)
[04/13/2021-22:41:26] [I] median: 49.2493 ms (end to end 49.3837 ms)
[04/13/2021-22:41:26] [I] percentile: 49.9407 ms at 99% (end to end 49.9504 ms at 99%)
[04/13/2021-22:41:26] [I] throughput: 0 qps
[04/13/2021-22:41:26] [I] walltime: 2.66488 s
[04/13/2021-22:41:26] [I] Enqueue Time
[04/13/2021-22:41:26] [I] min: 0.815918 ms
[04/13/2021-22:41:26] [I] max: 1.61914 ms
[04/13/2021-22:41:26] [I] median: 0.971466 ms
[04/13/2021-22:41:26] [I] GPU Compute
[04/13/2021-22:41:26] [I] min: 44.8051 ms
[04/13/2021-22:41:26] [I] max: 46.0015 ms
[04/13/2021-22:41:26] [I] mean: 45.3891 ms
[04/13/2021-22:41:26] [I] median: 45.3817 ms
[04/13/2021-22:41:26] [I] percentile: 46.0015 ms at 99%
[04/13/2021-22:41:26] [I] total compute time: 2.45101 s
&&&& PASSED TensorRT.trtexec # trtexec --loadEngine=CUSTOM-FCN-ResNet18-Model-2048x1024.onnx.1.1.7103.GPU.FP16.engine --fp16 --shapes=image.1:3x1024x2048 --explicitBatch

./segnet.py --model=CUSTOM-FCN-ResNet18-Model-2048x1024.onnx --labels=classes.txt --colors=colors.txt --input_blob=image.1 --output_blob=391 in.png out.png --filter-mode=point

...

[TRT]    CUDA engine context initialized on device GPU:
[TRT]       -- layers       72
[TRT]       -- maxBatchSize 1
[TRT]       -- workspace    0
[TRT]       -- deviceMemory 167380480
[TRT]       -- bindings     2
...
[TRT]    binding to input 0 image.1  binding index:  0
[TRT]    binding to input 0 image.1  dims (b=1 c=3 h=1024 w=2048) size=25165824
[TRT]    binding to output 0 391  binding index:  1
[TRT]    binding to output 0 391  dims (b=1 c=12 h=1024 w=2048) size=100663296

...

[TRT]    ------------------------------------------------
[TRT]    Timing Report CUSTOM-FCN-ResNet18-Model-2048x1024.onnx
[TRT]    ------------------------------------------------
[TRT]    Pre-Process   CPU   0.05667ms  CUDA   0.44134ms
[TRT]    Network       CPU 615.24158ms  CUDA 614.61536ms
[TRT]    Post-Process  CPU 311.16263ms  CUDA 311.36148ms
[TRT]    Visualize     CPU   0.01802ms  CUDA   0.25498ms
[TRT]    Total         CPU 926.47894ms  CUDA 926.67322ms
[TRT]    ------------------------------------------------

dusty_nv · April 14, 2021, 12:18am

Hmm…does it produce that performance every frame? Sometimes the first frame will be slower as kernels are loaded.

f.kosec · April 14, 2021, 7:24am

Once again, you’re right! Thanks for your answer. On every subsequent frame, the network part drops to 45-ish ms. Post-process remains at 311ms, but that should be easier for me to look into.

dusty_nv · April 14, 2021, 4:32pm

OK, great. My guess is that it is this part of the post-processing that is slow:

https://github.com/dusty-nv/jetson-inference/blob/2fb798e3e4895b51ce7315826297cf321f4bd577/c/segNet.cpp#L782

This is where the class with the highest probability is selected for each pixel.

It is just running on the CPU, which is normally fine for my models because they don’t do the deconv/upsample inside the model, so the probability/scores grid is low-resolution. However your probability/scores grid is the same resolution as the input because you are doing the deconv/upsample inside your model.

You could add more timing around that loop in the C++ code to see if that is indeed what is slowing it down.

system · October 5, 2021, 6:31pm

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
segmentation not working on TX2 Jetson TX2	6	1304	October 18, 2021
Unable to get segmentation to work with Jetson TX2 Jetson TX2	25	6884	October 18, 2021
Failed to parse ONNX i.e deeplabv3_resnet101.onnx semantic segmentation model on Jetson Xavier NX Jetson AGX Xavier tensorrt , python , segmentation	2	1296	March 9, 2022
ONNX Model Inference on Jetson Nano - Segmentation fault Jetson Nano tensorrt , jetson-inference	8	1515	October 15, 2021
Custom ResNet Jetson Xavier Jetson Xavier NX jetson-inference	12	3262	October 18, 2021
Extremely slow inference in TensorRT for live semantic segmentation model Jetson AGX Xavier tensorrt , tensorflow , jetson-inference	11	4529	April 12, 2022
Using custom model on segnet-camera.py of jetson-inference Jetson Nano jetson-inference	6	1073	October 18, 2021
Can't use any model with jetson-inference Jetson Orin Nano jetson-inference	4	792	November 4, 2024
ONNX model with Jetson-Inference using GPU Jetson Xavier NX tensorrt , jetson-inference , onnx	38	5975	October 18, 2021
Jetson-inference: mageNet.Classify() encountered an error Jetson Nano	14	1599	October 14, 2021

Jetson-inference - running a custom semantic segmentation model

Related topics