Issue with TensorRT 7.1.3 on Jetson AGX

Crossposting from TRT 7.1.3 - invalid results, but only on Jetpack

We’re noticing that SSD models (such as model mentioned here - https://docs.nvidia.com/deeplearning/tensorrt/release-notes/tensorrt-5.html#rel_5-0-2 ) generate incorrect output, but only when using TRT 7.1.3, and only on Jetson (e.g. the same model with the same code will generate the expected output with TRT 7.1.3 on all x86_64 platforms, as well as with TRT 6.0.1.5 on the same Jetson hardware)

This only seems to apply to SSD (we’ve seen the problem with the above publicly available model, as well as with four of our proprietary models); our Yolo model seems to work in this scenario.

Any help? At this point it feels like a Jetpack bug. With the model above, ‘fc6’ is the first layer which ends up spitting different result in TRT7, compared to TRT6. There’s nothing special about that layer, or that model.

Hi,

Do you use GPU for TensorRT or DLA?
To check this issue further, could you share the source code and model with us?

Thanks.

GPU. The model, as I’ve mentioned, is the one coming from models_VGGNet_VOC0712_SSD_300x300.tar.gz - Google Drive (see TRT 5.0.2 release notes link in the OP).

As for the code, I’ve tried to isolate and simplify it as much as possible (for example, removing async CUDA operation); none of which changes the outcome. I’m attaching the latest version.

The only thing I wasn’t able to do to simplify the things further is to statically link with TensorRT, as there are still symbols being unresolved in the final binary. I’ve opened a separate issue for that on the TRT forum.

sampleTRTLib.cpp (7.4 KB)

I’ve tried to put all the relevant pieces together, in a single repo: https://github.com/w3sip/sampleTRT

I’ve switched from 6-to-7 by toggling these two pairs of lines:

set (TRTVER 6)
set (CUDAVER 100)
# set (TRTVER 7)
# set (CUDAVER 102)

The output folders are then uploaded to TX2 in their entirety and ran using the following script:

export LD_DEBUG=libs

export CONFIGNAME=bin7-dynamic
pushd $CONFIGNAME
export LD_LIBRARY_PATH=`pwd`
./sampleTRT > ../$CONFIGNAME.log
popd


export CONFIGNAME=bin6-dynamic
pushd $CONFIGNAME
export LD_LIBRARY_PATH=`pwd`
./sampleTRT > ../$CONFIGNAME.log
popd

I’m attaching the resulting log from each run. Again, notice the distinct difference in outputs – same model and input are used in both cases.

bin6-dynamic.log (158.8 KB)
bin7-dynamic.log (230.4 KB)

Thanks for your data.

We are working reproduce this issue internally.
Will get back to you if we have any progress.

The problem seems to be around engine serialization/deserialization.
If we build the CUDA engine, and use it right away, the problem isn’t occurring.

If we serialize it with ICudaEngine::serialize(), and then de-serialize with IRuntime::deserializeCudaEngine(ihm->data(), ihm->size(), nullptr);
the error above is seen.

Stock sample_ssd reproduces the issue as well if replacing

auto context = SampleUniquePtr<nvinfer1::IExecutionContext>(mEngine->createExecutionContext());

with

IHostMemory* ihm = mEngine->serialize();
IRuntime* runtime = createInferRuntime(sample::gLogger.getTRTLogger());
ICudaEngine* engine = runtime->deserializeCudaEngine(ihm->data(), ihm->size(), nullptr);

auto context = SampleUniquePtr<nvinfer1::IExecutionContext>(engine->createExecutionContext());

Any updates, guys? A workaround will do, though we do need to be able to cache the CUDA engine …

Could this somehow be a problem? Attempting to build TRT from source, so we can debug this properly – and seeing these warnings:

/src/.build/TensorRT/parsers/caffe/../common/parserUtils.h:77:13: warning: enumeration value 'kBOOL' not handled in switch [-Wswitch]
    switch (t)
            ^
/src/.build/TensorRT/parsers/caffe/../common/parserUtils.h:99:13: warning: enumeration value 'kBOOL' not handled in switch [-Wswitch]
    switch (dt)
            ^
2 warnings generated.

Hi,

It seems that you are meeting a similar issue as TRT engine - peculiar behaviour.
The issue is fixed internally and will be available in our next major release for Jetson user.

Thanks.

Yes: with some help from the guys in TensorRT-OSS repository on Github, we’ve rebuilt nvinfer_plugins from TensorRT-OSS 7.1.3 tag, and that solved the issue. However, it is unfortunate that this has been known since at least November, and yet hasn’t found its way into a patch – let alone into the current version release notes. Would’ve saved us A LOT of time.

Hi,

Sorry for the inconvenience.
The fix is integrated into TensorRT v7.2, which is not available for Jetson user yet.

We try to verify your model with our internal build.
But fail to read the file with the below error

[03/10/2021-16:34:05] [I] === Reporting Options ===
[03/10/2021-16:34:05] [I] Verbose: Disabled
[03/10/2021-16:34:05] [I] Averages: 10 inferences
[03/10/2021-16:34:05] [I] Percentile: 99
[03/10/2021-16:34:05] [I] Dump output: Disabled
[03/10/2021-16:34:05] [I] Profile: Disabled
[03/10/2021-16:34:05] [I] Export timing to JSON file:
[03/10/2021-16:34:05] [I] Export output to JSON file:
[03/10/2021-16:34:05] [I] Export profile to JSON file:
[03/10/2021-16:34:05] [I]
[03/10/2021-16:34:06] [E] [TRT] CaffeParser: Could not parse binary model file
[03/10/2021-16:34:06] [E] [TRT] CaffeParser: Could not parse model file
[03/10/2021-16:34:06] [E] Failed to parse caffe model or prototxt, tensors blob not found
[03/10/2021-16:34:06] [E] Parsing model failed
[03/10/2021-16:34:06] [E] Engine creation failed
[03/10/2021-16:34:06] [E] Engine set up failed
&&&& FAILED TensorRT.trtexec # /usr/src/tensorrt/bin/trtexec --deploy=deploy.prototxt --model=model.caffemodel --output=detection_out

Do you have any idea about this?
Same error also occurs in TensorRT 7.1.3.

Thanks.