It seems that a particular ONNX model causes crashes on Jetson Nano (Maxwell). The problem shows up quite consistently and happens on the model inference stage (after successful TensorRT engine generation). It becomes visible in subsequent CUDA operations, but under gdb – an exception throw during the inference stage can be observed. TensorRT version is 8.2.1-1+cuda10.2.
GDB stack trace looks like this:
[MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +8, now: CPU 0, GPU 41 (MiB)
Thread 1 "main" hit Catchpoint 1 (exception thrown), 0x0000007fadbf1f20 in __cxa_throw () from /usr/lib/aarch64-linux-gnu/libstdc++.so.6
(gdb)
(gdb) bt
#0 0x0000007fadbf1f20 in __cxa_throw () from /usr/lib/aarch64-linux-gnu/libstdc++.so.6
#1 0x0000007fae7ce82c in nvinfer1::Lobber<nvinfer1::CudaRuntimeError>::operator()(char const*, char const*, int, int, nvinfer1::ErrorCode, char const*) () from /usr/lib/aarch64-linux-gnu/libnvinfer.so.8
#2 0x0000007faebbdbc8 in ?? () from /usr/lib/aarch64-linux-gnu/libnvinfer.so.8
#3 0x0000007faebbdc28 in ?? () from /usr/lib/aarch64-linux-gnu/libnvinfer.so.8
#4 0x0000007faebbdc98 in ?? () from /usr/lib/aarch64-linux-gnu/libnvinfer.so.8
#5 0x0000007faed55c1c in ?? () from /usr/lib/aarch64-linux-gnu/libnvinfer.so.8
#6 0x0000007faeda2720 in ?? () from /usr/lib/aarch64-linux-gnu/libnvinfer.so.8
#7 0x0000007faeda38e0 in ?? () from /usr/lib/aarch64-linux-gnu/libnvinfer.so.8
#8 0x0000005555556950 in nvinfer1::IExecutionContext::executeV2 (this=<optimized out>, bindings=0x7fffffec98) at /usr/include/aarch64-linux-gnu/NvInferRuntime.h:2275
(gdb) continue
Continuing.
1: [genericReformat.cu::executeMemcpy::1334] Error Code 1: Cuda Runtime (invalid argument)
app: app.cc:154: int main(): Assertion `ret == cudaSuccess' failed.
Thread 1 "main" received signal SIGABRT, Aborted.
Logs in dmesg:
[18364.512335] nvgpu: 57000000.gpu gk20a_fifo_handle_mmu_fault_locked:1723 [ERR] mmu fault on engine 0, engine subid 0 (gpc), client 1 (t1 0), addr 0x7f87cf3000, type 3 (va limit viol), access_type 0x00000001,inst_ptr 0x7feccf000
[18364.537265] nvgpu: 57000000.gpu gk20a_fifo_set_ctx_mmu_error_tsg:1543 [ERR] TSG 4 generated a mmu fault
[18364.546873] nvgpu: 57000000.gpu gk20a_fifo_set_ctx_mmu_error_ch:1532 [ERR] channel 507 generated a mmu fault
[18364.557045] nvgpu: 57000000.gpu nvgpu_set_error_notifier_locked:137 [ERR] error notifier set to 31 for ch 507
[18364.567460] nvgpu: 57000000.gpu gk20a_fifo_set_ctx_mmu_error_ch:1532 [ERR] channel 506 generated a mmu fault
[18364.577660] nvgpu: 57000000.gpu nvgpu_set_error_notifier_locked:137 [ERR] error notifier set to 31 for ch 506
[18364.588149] nvgpu: 57000000.gpu gk20a_fifo_set_ctx_mmu_error_ch:1532 [ERR] channel 505 generated a mmu fault
[18364.598270] nvgpu: 57000000.gpu nvgpu_set_error_notifier_locked:137 [ERR] error notifier set to 31 for ch 505
[18364.608556] nvgpu: 57000000.gpu gk20a_fifo_set_ctx_mmu_error_ch:1532 [ERR] channel 504 generated a mmu fault
[18364.618665] nvgpu: 57000000.gpu nvgpu_set_error_notifier_locked:137 [ERR] error notifier set to 31 for ch 504
trtexec seems to work normally somehow. Though it uses CUDA streams, while the problematic code sample does not – and not using streams is still valid use of API.
The code sample can be reduced even further by removing the output host buffer and the corresponding cudaMemcpy:
This way there are no CPU buffers used at all. The input GPU buffer is initialized by cudaMemset and the inference results are never fetched back from the GPU output buffer to CPU memory. But there are still errors:
1: [defaultAllocator.cpp::deallocate::35] Error Code 1: Cuda Runtime (unspecified launch failure)
1: [defaultAllocator.cpp::deallocate::35] Error Code 1: Cuda Runtime (unspecified launch failure)
1: [cudaResources.cpp::~ScopedCudaStream::47] Error Code 1: Cuda Runtime (unspecified launch failure)
1: [cudaResources.cpp::~ScopedCudaEvent::24] Error Code 1: Cuda Runtime (unspecified launch failure)
1: [cudaResources.cpp::~ScopedCudaEvent::24] Error Code 1: Cuda Runtime (unspecified launch failure)
@AastaLLL I’ve updated the code sample. It is even shorter now and uses a serialized TensorRT engine instead of the ONNX model. I think it is as straight-forward as it can be with regard to TensorRT API usage. The same errors appear.
I think I’ve located the problem. TensorRT ONNX parser for some reason decides that the network has 3 bindings instead of expected 2:
[0] data
[1] out
[2] out_before_shuffle
This is an obvious (in retrospective) cause for TensorRT to crash since it expects 3 valid pointers, but gets only 2, so a random data is interpreted as a pointer.
The problem is that there is only 1 input and 1 output in the ONNX, which is confirmed by ONNX Python package:
import onnx
m = onnx.load('model.onnx')
print([e.name for e in m.graph.input])
print([e.name for e in m.graph.output])
['data']
['out']
@AastaLLL So, it seems that there is a bug in TensorRT ONNX parser.
Well, I need to correct this a bit. INetworkDefinition::getNbInputs() and INetworkDefinition::getNbOutputs() produce the expected results. So, the ONNX parser is not the issue. Even more, there is no such thing as out_before_shuffle in the original model at all.
It looks like TensorRT created additional layers for internal purposes and for some reasons one of them was (incorrectly) marked as an output.
After tensor merging: 21 layers
Eliminating concatenation Concat_28
Generating copy for (Unnamed Layer* 29) [Fully Connected]_output to out_before_shuffle because input does not support striding.
Generating copy for (Unnamed Layer* 32) [Fully Connected]_output to out_before_shuffle because input does not support striding.
After concat removal: 22 layers
Graph construction and optimization completed in 0.00764541 seconds.
....
Layer(Reformat): (Unnamed Layer* 29) [Fully Connected]_output copy, Tactic: 0, (Unnamed Layer* 29) [Fully Connected]_output[Half(1,2,1,1)] -> out_before_shuffle[Float(1,2,1,1)]
Layer(Reformat): (Unnamed Layer* 32) [Fully Connected]_output copy, Tactic: 0, (Unnamed Layer* 32) [Fully Connected]_output[Half(1,2,1,1)] -> out_before_shuffle[Float(1,2,1,1)]
[I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +9, now: CPU 0, GPU 42 (MiB)
[I] Using random values for input data
[I] Created input binding for data with dimensions 1x3x56x128
[I] Using random values for output out
[I] Created output binding for out with dimensions 2x2
[I] Using random values for output out_before_shuffle
[I] Created output binding for out_before_shuffle with dimensions 2x2x1x1
[I] Starting inference
It can be seen that there are 3 bindings which includes 2 outputs (out and out_before_shuffle). Please tell if you need the complete log.
If I understand it correctly, Jetson Nano (Maxwell) is still supported (and will be available until 2027), but the support is limited to bug fixes and security issues, not the new features. So, should we expect a fix for the bug described in this thread?