I’m trying to run my custom ResNet based model with the help of jetson-inference. The model is trained in PyTorch 1.7.0, and then exported to ONNX opset version 11. I’m able to benchmark it with trtexec, but when using ./segnet-console or ./segnet-console.py with appropriate arguments for model, input_blob, output_blob, labels, and colors, I’m getting the following error:
[TRT] binding to input 0 image.1 binding index: 0
[TRT] binding to input 0 image.1 dims (b=1 c=3 h=1024 w=2048) size=25165824
[TRT] binding to output 0 391 binding index: 8
[TRT] binding to output 0 391 dims (b=1 c=12 h=1024 w=2048) size=100663296
[TRT]
[TRT] device GPU, /home/user/models/file_opset11_2048x1024.onnx initialized.
[TRT] segNet outputs -- s_w 2048 s_h 1024 s_c 12
[image] loaded 'images/warehouse.jpg' (2048x1024, 3 channels)
[TRT] ../rtSafe/cuda/cudaConvolutionRunner.cpp (457) - Cudnn Error in execute: 8 (CUDNN_STATUS_EXECUTION_FAILED)
[TRT] FAILED_EXECUTION: std::exception
[TRT] failed to execute TensorRT context on device GPU
segnet: failed to process segmentation
[image] imageLoader -- End of Stream (EOS) has been reached, stream has been closed
segnet: shutting down...
[cuda] an illegal memory access was encountered (error 700) (hex 0x2BC)
[cuda] /home/user/dev/jetson-inference/utils/image/imageLoader.cpp:105
[TRT] ../rtSafe/safeRuntime.cpp (32) - Cuda Error in free: 700 (an illegal memory access was encountered)
terminate called after throwing an instance of 'nvinfer1::CudaError'
what(): std::exception
[1] 18973 abort (core dumped) ./segnet-console --model=/home/user/models/file_opset11_2048x1024.onnx
Could you please advise me on how to successfully solve this issue? Thanks!
Also, it appears your model is internally performing the deconvolution (because the output dims == input dims), so when you process the overlay you want to use point filtering (i.e. run segnet program with --filter-mode=point flag). This is so it doesn’t needlessly perform bilinear interpolation on the output. I remove the deconv layer from my FCN-ResNet models because it’s linear and faster in my bilinear interpolation kernel.
Also, when I try trtexec with --fp16, it passes, but one of the line says: “[I] Precision: FP32+FP16” and “[I] Inputs format: fp32:CHW, [I] Outputs format: fp32:CHW”
Have you tried just specifying the ONNX model to the --model argument? It will select the .engine automatically if it exists, otherwise it will build it. You can put your custom model in a new directory.
I managed to get everything working really well with your help, regarding the inference output. Everything looks great (after editing the pre-processing you mentioned before). Thanks @dusty_nv! Your help has been invaluable!
One thing remains; maybe you can point me in the right direction. With jetson_clocks run, I ran benchmarks with trtexec and jetson-inference on my model and FCN-ResNet18-Cityscapes-2048x1024.
segnet.py results:
FCN-ResNet18-Cityscapes-2048x1024 - Total CPU 24.54ms CUDA 25.56ms
CUSTOM-FCN-ResNet18-Model-2048x1024 - Total CPU 926.48ms CUDA 926.67ms
How come segnet.py results differ so much, when trtexec reports them to be closer in value?
FCN-ResNet18-Cityscapes-2048x1024
[I] Host Latency
[I] min: 22.5043 ms (end to end 22.5135 ms)
[I] max: 23.5806 ms (end to end 23.5916 ms)
[I] mean: 23.118 ms (end to end 23.1271 ms)
[I] median: 23.1215 ms (end to end 23.1316 ms)
[I] percentile: 23.5413 ms at 99% (end to end 23.5557 ms at 99%)
[I] throughput: 0 qps
[I] walltime: 3.05278 s
[I] Enqueue Time
[I] min: 0.233643 ms
[I] max: 0.768311 ms
[I] median: 0.32074 ms
[I] GPU Compute
[I] min: 21.8194 ms
[I] max: 22.8459 ms
[I] mean: 22.4053 ms
[I] median: 22.4059 ms
[I] percentile: 22.8337 ms at 99%
[I] total compute time: 2.9575 s
&&&& PASSED TensorRT.trtexec # trtexec --loadEngine=fcn_resnet18.onnx.1.1.7103.GPU.FP16.engine --fp16 --shapes=image.1:3x1024x2048 --explicitBatch
./segnet.py --network=fcn-resnet18-cityscapes-2048x1024 in.png out.png
...
[TRT] CUDA engine context initialized on device GPU:
[TRT] -- layers 25
[TRT] -- maxBatchSize 1
[TRT] -- workspace 0
[TRT] -- deviceMemory 132862976
[TRT] -- bindings 2
...
[TRT] binding to input 0 input_0 binding index: 0
[TRT] binding to input 0 input_0 dims (b=1 c=3 h=1024 w=2048) size=25165824
[TRT] binding to output 0 output_0 binding index: 1
[TRT] binding to output 0 output_0 dims (b=1 c=21 h=32 w=64) size=172032
...
[TRT] ------------------------------------------------
[TRT] Timing Report networks/FCN-ResNet18-Cityscapes-2048x1024/fcn_resnet18.onnx
[TRT] ------------------------------------------------
[TRT] Pre-Process CPU 0.06714ms CUDA 0.43469ms
[TRT] Network CPU 23.90803ms CUDA 23.33222ms
[TRT] Post-Process CPU 0.54134ms CUDA 0.53978ms
[TRT] Visualize CPU 0.02618ms CUDA 0.25395ms
[TRT] Total CPU 24.54268ms CUDA 24.56064ms
[TRT] ------------------------------------------------
CUSTOM-FCN-ResNet18-Model-2048x1024
[04/13/2021-22:41:26] [I] Host Latency
[04/13/2021-22:41:26] [I] min: 48.5713 ms (end to end 48.5793 ms)
[04/13/2021-22:41:26] [I] max: 49.9407 ms (end to end 49.9504 ms)
[04/13/2021-22:41:26] [I] mean: 49.2631 ms (end to end 49.3496 ms)
[04/13/2021-22:41:26] [I] median: 49.2493 ms (end to end 49.3837 ms)
[04/13/2021-22:41:26] [I] percentile: 49.9407 ms at 99% (end to end 49.9504 ms at 99%)
[04/13/2021-22:41:26] [I] throughput: 0 qps
[04/13/2021-22:41:26] [I] walltime: 2.66488 s
[04/13/2021-22:41:26] [I] Enqueue Time
[04/13/2021-22:41:26] [I] min: 0.815918 ms
[04/13/2021-22:41:26] [I] max: 1.61914 ms
[04/13/2021-22:41:26] [I] median: 0.971466 ms
[04/13/2021-22:41:26] [I] GPU Compute
[04/13/2021-22:41:26] [I] min: 44.8051 ms
[04/13/2021-22:41:26] [I] max: 46.0015 ms
[04/13/2021-22:41:26] [I] mean: 45.3891 ms
[04/13/2021-22:41:26] [I] median: 45.3817 ms
[04/13/2021-22:41:26] [I] percentile: 46.0015 ms at 99%
[04/13/2021-22:41:26] [I] total compute time: 2.45101 s
&&&& PASSED TensorRT.trtexec # trtexec --loadEngine=CUSTOM-FCN-ResNet18-Model-2048x1024.onnx.1.1.7103.GPU.FP16.engine --fp16 --shapes=image.1:3x1024x2048 --explicitBatch
./segnet.py --model=CUSTOM-FCN-ResNet18-Model-2048x1024.onnx --labels=classes.txt --colors=colors.txt --input_blob=image.1 --output_blob=391 in.png out.png --filter-mode=point
...
[TRT] CUDA engine context initialized on device GPU:
[TRT] -- layers 72
[TRT] -- maxBatchSize 1
[TRT] -- workspace 0
[TRT] -- deviceMemory 167380480
[TRT] -- bindings 2
...
[TRT] binding to input 0 image.1 binding index: 0
[TRT] binding to input 0 image.1 dims (b=1 c=3 h=1024 w=2048) size=25165824
[TRT] binding to output 0 391 binding index: 1
[TRT] binding to output 0 391 dims (b=1 c=12 h=1024 w=2048) size=100663296
...
[TRT] ------------------------------------------------
[TRT] Timing Report CUSTOM-FCN-ResNet18-Model-2048x1024.onnx
[TRT] ------------------------------------------------
[TRT] Pre-Process CPU 0.05667ms CUDA 0.44134ms
[TRT] Network CPU 615.24158ms CUDA 614.61536ms
[TRT] Post-Process CPU 311.16263ms CUDA 311.36148ms
[TRT] Visualize CPU 0.01802ms CUDA 0.25498ms
[TRT] Total CPU 926.47894ms CUDA 926.67322ms
[TRT] ------------------------------------------------
Once again, you’re right! Thanks for your answer. On every subsequent frame, the network part drops to 45-ish ms. Post-process remains at 311ms, but that should be easier for me to look into.
This is where the class with the highest probability is selected for each pixel.
It is just running on the CPU, which is normally fine for my models because they don’t do the deconv/upsample inside the model, so the probability/scores grid is low-resolution. However your probability/scores grid is the same resolution as the input because you are doing the deconv/upsample inside your model.
You could add more timing around that loop in the C++ code to see if that is indeed what is slowing it down.