Issue with TensorRT 7.1.3 on Jetson AGX

alexm5m91 · February 10, 2021, 3:12pm

Crossposting from TRT 7.1.3 - invalid results, but only on Jetpack

We’re noticing that SSD models (such as model mentioned here - https://docs.nvidia.com/deeplearning/tensorrt/release-notes/tensorrt-5.html#rel_5-0-2 ) generate incorrect output, but only when using TRT 7.1.3, and only on Jetson (e.g. the same model with the same code will generate the expected output with TRT 7.1.3 on all x86_64 platforms, as well as with TRT 6.0.1.5 on the same Jetson hardware)

This only seems to apply to SSD (we’ve seen the problem with the above publicly available model, as well as with four of our proprietary models); our Yolo model seems to work in this scenario.

alexm5m91 · February 16, 2021, 2:23am

Any help? At this point it feels like a Jetpack bug. With the model above, ‘fc6’ is the first layer which ends up spitting different result in TRT7, compared to TRT6. There’s nothing special about that layer, or that model.

AastaLLL · February 17, 2021, 6:21am

Hi,

Do you use GPU for TensorRT or DLA?
To check this issue further, could you share the source code and model with us?

Thanks.

alexm5m91 · February 17, 2021, 5:58pm

GPU. The model, as I’ve mentioned, is the one coming from models_VGGNet_VOC0712_SSD_300x300.tar.gz - Google Drive (see TRT 5.0.2 release notes link in the OP).

As for the code, I’ve tried to isolate and simplify it as much as possible (for example, removing async CUDA operation); none of which changes the outcome. I’m attaching the latest version.

The only thing I wasn’t able to do to simplify the things further is to statically link with TensorRT, as there are still symbols being unresolved in the final binary. I’ve opened a separate issue for that on the TRT forum.

sampleTRTLib.cpp (7.4 KB)

alexm5m91 · February 17, 2021, 11:21pm

I’ve tried to put all the relevant pieces together, in a single repo: https://github.com/w3sip/sampleTRT

I’ve switched from 6-to-7 by toggling these two pairs of lines:

set (TRTVER 6)
set (CUDAVER 100)
# set (TRTVER 7)
# set (CUDAVER 102)

The output folders are then uploaded to TX2 in their entirety and ran using the following script:

export LD_DEBUG=libs

export CONFIGNAME=bin7-dynamic
pushd $CONFIGNAME
export LD_LIBRARY_PATH=`pwd`
./sampleTRT > ../$CONFIGNAME.log
popd


export CONFIGNAME=bin6-dynamic
pushd $CONFIGNAME
export LD_LIBRARY_PATH=`pwd`
./sampleTRT > ../$CONFIGNAME.log
popd

I’m attaching the resulting log from each run. Again, notice the distinct difference in outputs – same model and input are used in both cases.

bin6-dynamic.log (158.8 KB)
bin7-dynamic.log (230.4 KB)

AastaLLL · February 18, 2021, 8:10am

Thanks for your data.

We are working reproduce this issue internally.
Will get back to you if we have any progress.

alexm5m91 · February 22, 2021, 1:57am

The problem seems to be around engine serialization/deserialization.
If we build the CUDA engine, and use it right away, the problem isn’t occurring.

If we serialize it with ICudaEngine::serialize(), and then de-serialize with IRuntime::deserializeCudaEngine(ihm->data(), ihm->size(), nullptr);
the error above is seen.

alexm5m91 · February 22, 2021, 2:46am

Stock sample_ssd reproduces the issue as well if replacing

auto context = SampleUniquePtr<nvinfer1::IExecutionContext>(mEngine->createExecutionContext());

with

IHostMemory* ihm = mEngine->serialize();
IRuntime* runtime = createInferRuntime(sample::gLogger.getTRTLogger());
ICudaEngine* engine = runtime->deserializeCudaEngine(ihm->data(), ihm->size(), nullptr);

auto context = SampleUniquePtr<nvinfer1::IExecutionContext>(engine->createExecutionContext());

alexm5m91 · February 24, 2021, 1:14am

Any updates, guys? A workaround will do, though we do need to be able to cache the CUDA engine …

alexm5m91 · February 26, 2021, 11:08pm

Could this somehow be a problem? Attempting to build TRT from source, so we can debug this properly – and seeing these warnings:

/src/.build/TensorRT/parsers/caffe/../common/parserUtils.h:77:13: warning: enumeration value 'kBOOL' not handled in switch [-Wswitch]
    switch (t)
            ^
/src/.build/TensorRT/parsers/caffe/../common/parserUtils.h:99:13: warning: enumeration value 'kBOOL' not handled in switch [-Wswitch]
    switch (dt)
            ^
2 warnings generated.

AastaLLL · March 9, 2021, 4:16pm

Hi,

It seems that you are meeting a similar issue as TRT engine - peculiar behaviour.
The issue is fixed internally and will be available in our next major release for Jetson user.

Thanks.

alexm5m91 · March 9, 2021, 4:22pm

Yes: with some help from the guys in TensorRT-OSS repository on Github, we’ve rebuilt nvinfer_plugins from TensorRT-OSS 7.1.3 tag, and that solved the issue. However, it is unfortunate that this has been known since at least November, and yet hasn’t found its way into a patch – let alone into the current version release notes. Would’ve saved us A LOT of time.

AastaLLL · March 10, 2021, 8:37am

Hi,

Sorry for the inconvenience.
The fix is integrated into TensorRT v7.2, which is not available for Jetson user yet.

We try to verify your model with our internal build.
But fail to read the file with the below error

[03/10/2021-16:34:05] [I] === Reporting Options ===
[03/10/2021-16:34:05] [I] Verbose: Disabled
[03/10/2021-16:34:05] [I] Averages: 10 inferences
[03/10/2021-16:34:05] [I] Percentile: 99
[03/10/2021-16:34:05] [I] Dump output: Disabled
[03/10/2021-16:34:05] [I] Profile: Disabled
[03/10/2021-16:34:05] [I] Export timing to JSON file:
[03/10/2021-16:34:05] [I] Export output to JSON file:
[03/10/2021-16:34:05] [I] Export profile to JSON file:
[03/10/2021-16:34:05] [I]
[03/10/2021-16:34:06] [E] [TRT] CaffeParser: Could not parse binary model file
[03/10/2021-16:34:06] [E] [TRT] CaffeParser: Could not parse model file
[03/10/2021-16:34:06] [E] Failed to parse caffe model or prototxt, tensors blob not found
[03/10/2021-16:34:06] [E] Parsing model failed
[03/10/2021-16:34:06] [E] Engine creation failed
[03/10/2021-16:34:06] [E] Engine set up failed
&&&& FAILED TensorRT.trtexec # /usr/src/tensorrt/bin/trtexec --deploy=deploy.prototxt --model=model.caffemodel --output=detection_out

Do you have any idea about this?
Same error also occurs in TensorRT 7.1.3.

Thanks.

Topic		Replies	Views
TRT 7.1.3 - invalid results, but only on Jetpack TensorRT	2	490	February 16, 2021
Xavier NX Jetpack 4.4 GA TRT gives wrong results for a specific caffe model Jetson Xavier NX tensorrt	5	684	October 18, 2021
TensorRT Wrong/No Model Output DeTr Jetson AGX Xavier Jetson AGX Xavier tensorrt	9	937	December 31, 2022
TensorRT 6.0.1 performs worse than TensorRT 5.1.6 on Jetson AGX Xavier Jetson AGX Xavier	4	1185	October 18, 2021
TensorRT7.x Problem Jetson AGX Xavier tensorrt	17	638	December 15, 2020
Cuda Runtime error when switching from TensorRT 7.1.3 to TensorRT 8.0.1 Jetson AGX Xavier tensorrt , cuda	3	606	March 23, 2022
A strange Error when using tensorRT3 to accelerate my SSD model(caffe) Jetson TX1	17	1903	December 27, 2017
TensorRT Cask Error in checkCaskExecError<false> Jetson AGX Xavier cuda , jetson-inference	9	752	January 4, 2023
Cannot run model tested on GPU on TX2 - garbage returned. TensorRT	7	968	October 29, 2018
TF-TRT failed to build Engine in JetsonNano Jetson Nano tensorrt , tensorflow , jetson-inference	5	1596	October 15, 2021

Issue with TensorRT 7.1.3 on Jetson AGX

Related topics