Jetson TX2 NX crashes at "1024 Maximum number of threads per block" during trtexec inference

Description

I tried to convert a Faster-RCNN model into TensorRT-8.2 engine for deploying on Jetson TX2 NX (torch → onnx → trtexec trt). I re-compiled trtexec, plugin and onnxparser since TensorRT-8.2 does not natively support RoiAlign. The resulted trtexec could build an engine from my onnx model, but crashed during inference with Internal Error (Assertion status == kSTATUS_SUCCESS failed. ). I later identified the cause of crashing being TX2 NX not supporting “Maximum number of threads per block” as 1024. Nevertheless, the specs of TX2 NX clearly mentions that it supports up to 1024 threads (as reported in deviceQuery). What could be the cause of this inconsitency?

Environment

TensorRT Version: 8.2.1.8 (JetPack 4.6.3)
GPU Type: 1 Jetson TX2 NX
CUDA Version: 10.2
CUDNN Version: 8.0.0
Operating System + Version: Ubuntu 18.04
Python Version (if applicable): 3.10.8
PyTorch Version (if applicable): 1.13.1
Torchvision Version (if applicable): 0.14.1

Relevant Files

I am happy to provide relevant files (e.g., onnx model and TensorRT OSS package, etc.) via DM.

Steps To Reproduce

Context:
My model is a torchvision Faster R-CNN model where I replaced the backbone with ResNet10, and configured the detection head to predict boxes of a single category (plus background). The trained model was first converted into onnx format via torch.onnx.export(). I mainly tested with opset_version==11 (also experimented with other versions, but all led to the same results).
Jetson TX2 is officially compatible up to TensorRT-8.2 (JetPack 4.6.3), which does not natively support RoiAlign used in Faster-RCNN. Therefore, I manually added roiAlignPlugin from the official TensorRT OSS release/8.5 and then recompiled relevant .so and trtexec.

Step 1 – Test on GTX 1080.
I adapted the codes of roiAlignPlugin and onnx parser from TensorRT OSS 8.5 into my TensorRT OSS 8.2 project (i.e., TRT_OSSPATH). I ran the followings to build new libnvinfer_plugin.so.8, libnvonnxparser.so.8. and trtexec.

cd $TRT_OSSPATH

mkdir -p build && cd build

cmake … -DTRT_LIB_DIR=$TRT_LIBPATH -DTRT_OUT_DIR=pwd/out - DCUDA_VERSION=11.8 -DGPU_ARCHS=“61”

make -j$(nproc)

where TRT_LIBPATH corresponds to the downloaded TensorRT-8.2.1.8.Linux.x86_64-gnu path. I confirmed that the recompiled trtexec from Step 1 produced correct detection on my GTX 1080.

Step 2 – Cross-compilation targeting Jetson TX2 NX

To deploy my model on Jetson TX2 NX, I chose cross compilation from my GTX 1080 following Example: Ubuntu 18.04 Cross-Compile for Jetson (aarch64) with cuda-10.2 (JetPack) from GitHub - NVIDIA/TensorRT: NVIDIA® TensorRT™, an SDK for high-performance deep learning inference, includes a deep learning inference optimizer and runtime that delivers low latency and high throughput for inference applications..

cd $TRT_OSSPATH

mkdir -p build && cd build

cmake … -DCMAKE_TOOLCHAIN_FILE=$TRT_OSSPATH/cmake/toolchains/cmake_aarch64_jetson.toolchain -DTRT_LIB_DIR=$TRT_LIBPATH/lib -DTRT_OUT_DIR=pwd/out -DCUDA_VERSION=10.2 -DCUDNN_LIB=$TX2_CUDA_PATH/lib/libcudnn.so -DCUBLAS_LIB=$TX2_CUDA_PATH/lib/libcublas.so.10 -DCUBLASLT_LIB=$TX2_CUDA_PATH/lib/libcublasLt.so.10 -DCUDA_TOOLKIT_ROOT_DIR=$TX2_CUDA_PATH -DCUDNN_ROOT_DIR=$TX2_CUDNN_PATH -DCUDART_LIB=$TX2_CUDA_PATH/lib/libcudart.so -DCMAKE_CUDA_COMPILER=$TX2_CUDA_PATH/bin/nvcc -DCUDA_INCLUDE_DIRS=$TX2_CUDA_PATH/include -DGPU_ARCHS=“62” -DTRT_PLATFORM_ID=aarch64

make -j$(nproc)

I had to add many more input specifications for a successful build. Eventually I was able to build new libnvinfer_plugin.so.8, libnvonnxparser.so.8. and trtexec targeting Jetson TX2 NX.

Step 3 – Test on Jetson TX2 NX.
I copied the above components onto Jetson device and run trtexec –onnx=model.onnx –saveEngine=model.trt. I obtained the following error message. An engine was built with success, but trtexec crashed in the inference stage.

[09/06/2022-13:00:11] [I] === Model Options ===

[09/06/2022-13:00:11] [I] Format: ONNX

[09/06/2022-13:00:11] [I] Model: model.onnx

[09/06/2022-13:00:11] [I] Output:

[09/06/2022-13:00:11] [I] === Build Options ===

[09/06/2022-13:00:11] [I] Max batch: explicit batch

[09/06/2022-13:00:11] [I] Workspace: 16 MiB

[09/06/2022-13:00:11] [I] minTiming: 1

[09/06/2022-13:00:11] [I] avgTiming: 8

[09/06/2022-13:00:11] [I] Precision: FP32

[09/06/2022-13:00:11] [I] Calibration:

[09/06/2022-13:00:11] [I] Refit: Disabled

[09/06/2022-13:00:11] [I] Sparsity: Disabled

[09/06/2022-13:00:11] [I] Safe mode: Disabled

[09/06/2022-13:00:11] [I] DirectIO mode: Disabled

[09/06/2022-13:00:11] [I] Restricted mode: Disabled

[09/06/2022-13:00:11] [I] Save engine:

[09/06/2022-13:00:11] [I] Load engine:

[09/06/2022-13:00:11] [I] Profiling verbosity: 0

[09/06/2022-13:00:11] [I] Tactic sources: Using default tactic sources

[09/06/2022-13:00:11] [I] timingCacheMode: local

[09/06/2022-13:00:11] [I] timingCacheFile:

[09/06/2022-13:00:11] [I] Input(s)s format: fp32:CHW

[09/06/2022-13:00:11] [I] Output(s)s format: fp32:CHW

[09/06/2022-13:00:11] [I] Input build shapes: model

[09/06/2022-13:00:11] [I] Input calibration shapes: model

[09/06/2022-13:00:11] [I] === System Options ===

[09/06/2022-13:00:11] [I] Device: 0

[09/06/2022-13:00:11] [I] DLACore:

[09/06/2022-13:00:11] [I] Plugins:

[09/06/2022-13:00:11] [I] === Inference Options ===

[09/06/2022-13:00:11] [I] Batch: Explicit

[09/06/2022-13:00:11] [I] Input inference shapes: model

[09/06/2022-13:00:11] [I] Iterations: 10

[09/06/2022-13:00:11] [I] Duration: 3s (+ 200ms warm up)

[09/06/2022-13:00:11] [I] Sleep time: 0ms

[09/06/2022-13:00:11] [I] Idle time: 0ms

[09/06/2022-13:00:11] [I] Streams: 1

[09/06/2022-13:00:11] [I] ExposeDMA: Disabled

[09/06/2022-13:00:11] [I] Data transfers: Enabled

[09/06/2022-13:00:11] [I] Spin-wait: Disabled

[09/06/2022-13:00:11] [I] Multithreading: Disabled

[09/06/2022-13:00:11] [I] CUDA Graph: Disabled

[09/06/2022-13:00:11] [I] Separate profiling: Disabled

[09/06/2022-13:00:11] [I] Time Deserialize: Disabled

[09/06/2022-13:00:11] [I] Time Refit: Disabled

[09/06/2022-13:00:11] [I] Skip inference: Disabled

[09/06/2022-13:00:11] [I] Inputs:

[09/06/2022-13:00:11] [I] === Reporting Options ===

[09/06/2022-13:00:11] [I] Verbose: Disabled

[09/06/2022-13:00:11] [I] Averages: 10 inferences

[09/06/2022-13:00:11] [I] Percentile: 99

[09/06/2022-13:00:11] [I] Dump refittable layers:Disabled

[09/06/2022-13:00:11] [I] Dump output: Disabled

[09/06/2022-13:00:11] [I] Profile: Disabled

[09/06/2022-13:00:11] [I] Export timing to JSON file:

[09/06/2022-13:00:11] [I] Export output to JSON file:

[09/06/2022-13:00:11] [I] Export profile to JSON file:

[09/06/2022-13:00:11] [I]

[09/06/2022-13:00:11] [I] === Device Information ===

[09/06/2022-13:00:11] [I] Selected Device: NVIDIA Tegra X2

[09/06/2022-13:00:11] [I] Compute Capability: 6.2

[09/06/2022-13:00:11] [I] SMs: 2

[09/06/2022-13:00:11] [I] Compute Clock Rate: 1.3 GHz

[09/06/2022-13:00:11] [I] Device Global Memory: 3825 MiB

[09/06/2022-13:00:11] [I] Shared Memory per SM: 64 KiB

[09/06/2022-13:00:11] [I] Memory Bus Width: 128 bits (ECC disabled)

[09/06/2022-13:00:11] [I] Memory Clock Rate: 1.3 GHz

[09/06/2022-13:00:11] [I]

[09/06/2022-13:00:11] [I] TensorRT version: 8.2.5

[09/06/2022-13:00:13] [I] [TRT] [MemUsageChange] Init CUDA: CPU +267, GPU +0, now: CPU 285, GPU 1692 (MiB)

[09/06/2022-13:00:13] [I] [TRT] [MemUsageSnapshot] Begin constructing builder kernel library: CPU 285 MiB, GPU 1720 MiB

[09/06/2022-13:00:13] [I] [TRT] [MemUsageSnapshot] End constructing builder kernel library: CPU 314 MiB, GPU 1749 MiB

[09/06/2022-13:00:13] [I] Start parsing network model

[09/06/2022-13:00:14] [I] [TRT] ----------------------------------------------------------------

[09/06/2022-13:00:14] [I] [TRT] Input filename: model.onnx

[09/06/2022-13:00:14] [I] [TRT] ONNX IR version: 0.0.8

[09/06/2022-13:00:14] [I] [TRT] Opset version: 11

[09/06/2022-13:00:14] [I] [TRT] Producer name: pytorch

[09/06/2022-13:00:14] [I] [TRT] Producer version: 1.13.1

[09/06/2022-13:00:14] [I] [TRT] Domain:

[09/06/2022-13:00:14] [I] [TRT] Model version: 0

[09/06/2022-13:00:14] [I] [TRT] Doc string:

[09/06/2022-13:00:14] [I] [TRT] ----------------------------------------------------------------

[09/06/2022-13:00:14] [W] [TRT] parsers/onnx/onnx2trt_utils.cpp:370: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.

[09/06/2022-13:00:14] [W] [TRT] parsers/onnx/onnx2trt_utils.cpp:396: One or more weights outside the range of INT32 was clamped

[09/06/2022-13:00:14] [W] [TRT] parsers/onnx/onnx2trt_utils.cpp:396: One or more weights outside the range of INT32 was clamped

[09/06/2022-13:00:14] [W] [TRT] parsers/onnx/onnx2trt_utils.cpp:396: One or more weights outside the range of INT32 was clamped

[09/06/2022-13:00:14] [W] [TRT] parsers/onnx/onnx2trt_utils.cpp:396: One or more weights outside the range of INT32 was clamped

[09/06/2022-13:00:14] [W] [TRT] parsers/onnx/onnx2trt_utils.cpp:396: One or more weights outside the range of INT32 was clamped

[09/06/2022-13:00:14] [W] [TRT] parsers/onnx/onnx2trt_utils.cpp:396: One or more weights outside the range of INT32 was clamped

[09/06/2022-13:00:14] [W] [TRT] parsers/onnx/onnx2trt_utils.cpp:396: One or more weights outside the range of INT32 was clamped

[09/06/2022-13:00:14] [W] [TRT] parsers/onnx/onnx2trt_utils.cpp:396: One or more weights outside the range of INT32 was clamped

[09/06/2022-13:00:14] [W] [TRT] parsers/onnx/onnx2trt_utils.cpp:396: One or more weights outside the range of INT32 was clamped

[09/06/2022-13:00:14] [W] [TRT] parsers/onnx/onnx2trt_utils.cpp:396: One or more weights outside the range of INT32 was clamped

[09/06/2022-13:00:14] [W] [TRT] parsers/onnx/onnx2trt_utils.cpp:396: One or more weights outside the range of INT32 was clamped

[09/06/2022-13:00:14] [W] [TRT] parsers/onnx/onnx2trt_utils.cpp:396: One or more weights outside the range of INT32 was clamped

[09/06/2022-13:00:14] [W] [TRT] parsers/onnx/onnx2trt_utils.cpp:396: One or more weights outside the range of INT32 was clamped

[09/06/2022-13:00:14] [W] [TRT] parsers/onnx/onnx2trt_utils.cpp:396: One or more weights outside the range of INT32 was clamped

[09/06/2022-13:00:14] [W] [TRT] parsers/onnx/onnx2trt_utils.cpp:396: One or more weights outside the range of INT32 was clamped

[09/06/2022-13:00:14] [I] Finish parsing network model

[09/06/2022-13:00:14] [I] [TRT] ---------- Layers Running on DLA ----------

[09/06/2022-13:00:14] [I] [TRT] ---------- Layers Running on GPU ----------

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /transform/Constant_output_0

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /transform/Constant_1_output_0

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /Constant_output_0

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /rpn/Gather_17_output_0

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /rpn/Constant_24_output_0 + (Unnamed Layer* 68) [Shuffle]

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /rpn/Constant_29_output_0 + (Unnamed Layer* 71) [Shuffle]

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /rpn/Constant_34_output_0 + (Unnamed Layer* 74) [Shuffle]

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /rpn/Constant_39_output_0 + (Unnamed Layer* 77) [Shuffle]

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] (Unnamed Layer* 80) [Constant] + (Unnamed Layer* 82) [Shuffle]

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /rpn/Constant_40_output_0 + (Unnamed Layer* 83) [Shuffle]

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] (Unnamed Layer* 86) [Constant] + (Unnamed Layer* 88) [Shuffle]

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /rpn/Constant_41_output_0 + (Unnamed Layer* 89) [Shuffle]

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /rpn/Unsqueeze_11_output_0

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /rpn/Unsqueeze_13_output_0

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /rpn/Unsqueeze_12_output_0

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /rpn/Unsqueeze_14_output_0

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /Constant_output_0_0

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /rpn/Unsqueeze_11_output_0_1

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /rpn/Unsqueeze_13_output_0_2

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /rpn/Gather_21_output_0

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /rpn/Constant_42_output_0 + (Unnamed Layer* 113) [Shuffle]

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /rpn/Constant_43_output_0 + (Unnamed Layer* 116) [Shuffle]

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /rpn/Gather_17_output_0_3

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] onnx::Max_553 + (Unnamed Layer* 149) [Shuffle]

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] onnx::Max_553_4 + (Unnamed Layer* 152) [Shuffle]

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /rpn/Cast_6_output_0 + (Unnamed Layer* 155) [Shuffle]

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /rpn/Cast_7_output_0 + (Unnamed Layer* 158) [Shuffle]

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] onnx::Add_575

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] onnx::Gather_586

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] (Unnamed Layer* 186) [Constant] + (Unnamed Layer* 187) [Shuffle]

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /Constant_2_output_0

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /Constant_output_0_6

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /transform/Constant_6_output_0

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /Constant_1_output_0

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /roi_heads/Constant_3_output_0 + (Unnamed Layer* 216) [Shuffle]

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /roi_heads/Constant_4_output_0 + (Unnamed Layer* 219) [Shuffle]

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /roi_heads/box_roi_pool/Constant_4_output_0

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /roi_heads/box_roi_pool/Constant_5_output_0

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] roi_heads.box_head.fc6.weight

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] roi_heads.box_head.fc6.bias + (Unnamed Layer* 238) [Shuffle]

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] roi_heads.box_head.fc7.weight

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] roi_heads.box_head.fc7.bias + (Unnamed Layer* 244) [Shuffle]

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] roi_heads.box_predictor.cls_score.weight

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] roi_heads.box_predictor.cls_score.bias + (Unnamed Layer* 250) [Shuffle]

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] roi_heads.box_predictor.bbox_pred.weight

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] roi_heads.box_predictor.bbox_pred.bias + (Unnamed Layer* 255) [Shuffle]

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /roi_heads/Constant_9_output_0 + (Unnamed Layer* 280) [Shuffle]

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /roi_heads/Constant_14_output_0 + (Unnamed Layer* 283) [Shuffle]

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /roi_heads/Constant_19_output_0 + (Unnamed Layer* 286) [Shuffle]

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /roi_heads/Constant_24_output_0 + (Unnamed Layer* 289) [Shuffle]

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] (Unnamed Layer* 293) [Constant] + (Unnamed Layer* 295) [Shuffle]

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /roi_heads/Constant_25_output_0 + (Unnamed Layer* 296) [Shuffle]

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] (Unnamed Layer* 299) [Constant] + (Unnamed Layer* 301) [Shuffle]

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /roi_heads/Constant_26_output_0 + (Unnamed Layer* 302) [Shuffle]

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /roi_heads/Reshape_3_output_0

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /roi_heads/Constant_27_output_0 + (Unnamed Layer* 327) [Shuffle]

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /roi_heads/Constant_28_output_0 + (Unnamed Layer* 330) [Shuffle]

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] onnx::Max_553_7 + (Unnamed Layer* 362) [Shuffle]

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] onnx::Max_553_8 + (Unnamed Layer* 365) [Shuffle]

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /rpn/Cast_6_output_0_9 + (Unnamed Layer* 370) [Shuffle]

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /rpn/Cast_7_output_0_10 + (Unnamed Layer* 373) [Shuffle]

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /Squeeze

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /transform/Sub

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /transform/Div

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /transform/Unsqueeze

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /transform/Resize

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /transform/Gather

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /transform/Pad

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /transform/Unsqueeze_12

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /transform/Unsqueeze_12_output_0 copy

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /backbone/backbone.0/conv/conv/Conv + /backbone/backbone.0/conv/activ/Relu

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /backbone/backbone.0/pool/MaxPool

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /backbone/backbone.1/unit1/body/conv1/conv/Conv + /backbone/backbone.1/unit1/body/conv1/activ/Relu

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /backbone/backbone.1/unit1/body/conv2/conv/Conv + /backbone/backbone.1/unit1/Add + /backbone/backbone.1/unit1/activ/Relu

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /backbone/backbone.2/unit1/body/conv1/conv/Conv + /backbone/backbone.2/unit1/body/conv1/activ/Relu

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /backbone/backbone.2/unit1/body/conv2/conv/Conv

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /backbone/backbone.2/unit1/identity_conv/conv/Conv + /backbone/backbone.2/unit1/Add + /backbone/backbone.2/unit1/activ/Relu

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /backbone/backbone.3/unit1/body/conv1/conv/Conv + /backbone/backbone.3/unit1/body/conv1/activ/Relu

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /backbone/backbone.3/unit1/body/conv2/conv/Conv

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /backbone/backbone.3/unit1/identity_conv/conv/Conv + /backbone/backbone.3/unit1/Add + /backbone/backbone.3/unit1/activ/Relu

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /backbone/backbone.4/unit1/body/conv1/conv/Conv + /backbone/backbone.4/unit1/body/conv1/activ/Relu

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /backbone/backbone.4/unit1/body/conv2/conv/Conv

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /backbone/backbone.4/unit1/identity_conv/conv/Conv + /backbone/backbone.4/unit1/Add + /backbone/backbone.4/unit1/activ/Relu

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /rpn/head/conv/conv.0/conv.0.0/Conv + /rpn/head/conv/conv.0/conv.0.1/Relu

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /rpn/head/bbox_pred/Conv || /rpn/head/cls_logits/Conv

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /rpn/Reshape + /rpn/Transpose

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /rpn/Reshape_2 + /rpn/Transpose_1

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /rpn/Reshape_1

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /rpn/Reshape_3

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /rpn/Reshape_1_output_0 copy

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /rpn/Reshape_3_output_0 copy

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /rpn/Reshape_4 + /rpn/Reshape_5

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /rpn/Flatten + /rpn/Reshape_8

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /rpn/Gather_19

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /rpn/Slice

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /rpn/Slice_1

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /rpn/Slice_2

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /rpn/Slice_3

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /rpn/TopK

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /rpn/Div_2

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /rpn/Div_3

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /rpn/Div_4

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /rpn/Div_5

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /rpn/Mul_4

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /rpn/Mul_5

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] (Unnamed Layer* 84) [ElementWise]

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] (Unnamed Layer* 90) [ElementWise]

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /rpn/Gather_18

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /rpn/Add_2

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /rpn/Add_3

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] (Unnamed Layer* 85) [ElementWise]

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] (Unnamed Layer* 91) [ElementWise]

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /rpn/Gather_20

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /rpn/Gather_22

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /rpn/Exp

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /rpn/Exp_1

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /rpn/Unsqueeze_25

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /rpn/Mul_6

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /rpn/Mul_7

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] PWN(/rpn/Sigmoid)

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /rpn/Mul_9

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /rpn/Mul_8

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /rpn/Sub_2

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /rpn/Add_4

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /rpn/Sub_3

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /rpn/Add_5

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] Cast_464

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /rpn/Unsqueeze_15

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /rpn/Unsqueeze_17

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /rpn/Unsqueeze_16

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /rpn/Unsqueeze_18

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /rpn/Unsqueeze_15_output_0 copy

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /rpn/Unsqueeze_16_output_0 copy

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /rpn/Unsqueeze_17_output_0 copy

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /rpn/Unsqueeze_18_output_0 copy

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /rpn/Squeeze_1 + Unsqueeze_471 + Unsqueeze_472 + NonMaxSuppression_475

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /rpn/Flatten_1 + /rpn/Reshape_6 + /rpn/Reshape_7

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /rpn/Gather_23

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /rpn/Gather_24

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /rpn/Slice_5

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /rpn/Slice_6

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /rpn/Max

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /rpn/Max_1

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /rpn/Min

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /rpn/Min_1

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /rpn/Unsqueeze_28

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /rpn/Unsqueeze_29

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /rpn/Unsqueeze_28_output_0 copy

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /rpn/Unsqueeze_29_output_0 copy

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /rpn/Reshape_11

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] ReduceMax_463

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] Add_466

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] (Unnamed Layer* 167) [Shuffle]

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] Mul_467

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] Unsqueeze_468

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] Add_469

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] Unsqueeze_470

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] NonMaxSuppression_475_5

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] Gather_477

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] Squeeze_478

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /rpn/Gather_27

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /roi_heads/Cast

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /roi_heads/Gather_2

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /roi_heads/Gather_3

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /roi_heads/Gather_4

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /roi_heads/Gather_5

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /roi_heads/box_roi_pool/ConstantOfShape

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /roi_heads/Sub

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /roi_heads/Sub_1

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /roi_heads/Mul

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /roi_heads/Unsqueeze_1

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /roi_heads/Mul_1

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /roi_heads/Unsqueeze_3

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /roi_heads/box_roi_pool/Concat_1_output_0 copy

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /roi_heads/box_roi_pool/Concat_output_0 copy

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /roi_heads/Add

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /roi_heads/Add_1

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /roi_heads/box_roi_pool/Gather

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /roi_heads/box_roi_pool/Gather_1

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /roi_heads/Unsqueeze_2

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /roi_heads/Unsqueeze_4

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /roi_heads/box_roi_pool/Squeeze

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /roi_heads/box_roi_pool/Cast

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /roi_heads/box_roi_pool/RoiAlign

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /roi_heads/box_head/Flatten

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /roi_heads/box_head/fc6/Gemm

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] (Unnamed Layer* 239) [ElementWise] + /roi_heads/box_head/Relu

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /roi_heads/box_head/fc7/Gemm

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] (Unnamed Layer* 245) [ElementWise] + /roi_heads/box_head/Relu_1

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /roi_heads/box_predictor/cls_score/Gemm

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /roi_heads/box_predictor/bbox_pred/Gemm

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] (Unnamed Layer* 251) [ElementWise]

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] (Unnamed Layer* 256) [ElementWise]

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /roi_heads/Softmax

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /roi_heads/Slice

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /roi_heads/Slice_1

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /roi_heads/Slice_2

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /roi_heads/Slice_3

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /roi_heads/Div

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /roi_heads/Div_1

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /roi_heads/Div_2

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /roi_heads/Div_3

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /roi_heads/Mul_2

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /roi_heads/Mul_3

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] (Unnamed Layer* 297) [ElementWise]

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] (Unnamed Layer* 303) [ElementWise]

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /roi_heads/Add_2

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /roi_heads/Add_3

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] (Unnamed Layer* 298) [ElementWise]

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] (Unnamed Layer* 304) [ElementWise]

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /roi_heads/Reshape_5

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /roi_heads/Exp

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /roi_heads/Exp_1

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /roi_heads/Mul_4

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /roi_heads/Mul_5

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /roi_heads/Mul_7

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /roi_heads/Mul_6

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /roi_heads/Expand

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /roi_heads/Sub_2

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /roi_heads/Add_4

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /roi_heads/Sub_3

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /roi_heads/Add_5

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /roi_heads/Unsqueeze_5

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /roi_heads/Unsqueeze_7

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /roi_heads/Unsqueeze_6

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /roi_heads/Unsqueeze_8

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /roi_heads/Reshape_6

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /roi_heads/Unsqueeze_5_output_0 copy

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /roi_heads/Unsqueeze_6_output_0 copy

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /roi_heads/Unsqueeze_7_output_0 copy

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /roi_heads/Unsqueeze_8_output_0 copy

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] Cast_670

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /roi_heads/Slice_6

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /roi_heads/Slice_7

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /roi_heads/Max

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /roi_heads/Max_1

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /roi_heads/Min

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /roi_heads/Min_1

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /roi_heads/Unsqueeze_12

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /roi_heads/Unsqueeze_13

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /roi_heads/Unsqueeze_12_output_0 copy

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /roi_heads/Unsqueeze_13_output_0 copy

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /roi_heads/Reshape_2

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /roi_heads/Slice_8

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /roi_heads/Reshape_4

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] ReduceMax_669

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] onnx::Add_785

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] Add_672

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] (Unnamed Layer* 388) [Shuffle]

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] Mul_673

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] Unsqueeze_674

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] Add_675

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] Unsqueeze_676

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] Unsqueeze_677 + Unsqueeze_678 + NonMaxSuppression_681

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] NonMaxSuppression_681_11

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] onnx::Gather_796

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] Gather_683

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] Squeeze_684

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /roi_heads/Gather_9

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /roi_heads/Gather_10

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /roi_heads/Gather_11

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /Squeeze_1

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /Squeeze_2

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /Squeeze_3

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /Squeeze_4

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /Div_1_output_0 + (Unnamed Layer* 411) [Shuffle]

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /Mul

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /Div_1_output_0_17 + (Unnamed Layer* 414) [Shuffle]

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /Mul_1

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /Div_output_0 + (Unnamed Layer* 417) [Shuffle]

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /Mul_2

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /Div_output_0_18 + (Unnamed Layer* 420) [Shuffle]

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /Mul_3

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /Unsqueeze

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /Unsqueeze_1

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /Unsqueeze_2

[09/06/2022-13:00:14] [I] [TRT] [GpuLayer] /Unsqueeze_3

[09/06/2022-13:00:15] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +167, GPU +162, now: CPU 613, GPU 2188 (MiB)

[09/06/2022-13:00:17] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +250, GPU +286, now: CPU 863, GPU 2474 (MiB)

[09/06/2022-13:00:17] [I] [TRT] Local timing cache in use. Profiling results in this builder pass will not be stored.

[09/06/2022-13:00:57] [I] [TRT] Some tactics do not have sufficient workspace memory to run. Increasing workspace size may increase performance, please check verbose output.

[09/06/2022-13:01:52] [I] [TRT] Detected 1 inputs and 7 output network tensors.

[09/06/2022-13:01:52] [I] [TRT] Total Host Persistent Memory: 26752

[09/06/2022-13:01:52] [I] [TRT] Total Device Persistent Memory: 36447744

[09/06/2022-13:01:52] [I] [TRT] Total Scratch Memory: 512000

[09/06/2022-13:01:52] [I] [TRT] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 9 MiB, GPU 384 MiB

[09/06/2022-13:01:52] [I] [TRT] [BlockAssignment] Algorithm ShiftNTopDown took 205.304ms to assign 15 blocks to 199 nodes requiring 83747843 bytes.

[09/06/2022-13:01:52] [I] [TRT] Total Activation Memory: 83747843

[09/06/2022-13:01:52] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 1115, GPU 2955 (MiB)

[09/06/2022-13:01:52] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +1, GPU +0, now: CPU 1116, GPU 2955 (MiB)

[09/06/2022-13:01:52] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in building engine: CPU +0, GPU +256, now: CPU 0, GPU 256 (MiB)

[09/06/2022-13:01:52] [I] [TRT] [MemUsageChange] Init CUDA: CPU +0, GPU +0, now: CPU 1251, GPU 3092 (MiB)

[09/06/2022-13:01:52] [I] [TRT] Loaded engine size: 137 MiB

[09/06/2022-13:01:52] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 1252, GPU 3094 (MiB)

[09/06/2022-13:01:52] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +0, GPU +0, now: CPU 1252, GPU 3094 (MiB)

[09/06/2022-13:01:52] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +136, now: CPU 0, GPU 136 (MiB)

[09/06/2022-13:01:52] [I] Engine built in 101.427 sec.

[09/06/2022-13:01:52] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 954, GPU 2841 (MiB)

[09/06/2022-13:01:52] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +0, GPU +0, now: CPU 954, GPU 2841 (MiB)

[09/06/2022-13:01:52] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +115, now: CPU 0, GPU 251 (MiB)

[09/06/2022-13:01:52] [I] Using random values for input input0

[09/06/2022-13:01:52] [I] Created input binding for input0 with dimensions 1x3x480x640

[09/06/2022-13:01:52] [I] Using random values for output scores

[09/06/2022-13:01:52] [I] Created output binding for scores with dimensions 100

[09/06/2022-13:01:52] [I] Using random values for output labels

[09/06/2022-13:01:52] [I] Created output binding for labels with dimensions 100

[09/06/2022-13:01:52] [I] Using random values for output boxes

[09/06/2022-13:01:52] [I] Created output binding for boxes with dimensions 100x4

[09/06/2022-13:01:52] [I] Starting inference

[09/06/2022-13:01:52] [E] **Error[2]: [pluginV2DynamicExtRunner.cpp::execute::115] Error Code 2: Internal Error (Assertion status == kSTATUS_SUCCESS failed. )**

[09/06/2022-13:01:52] [E] Error occurred during inference`Preformatted text`

Step 4 – Workaround
In TensorRT OSS 8.2 → plugin→ roiAlignPlugin → roiAlignPlugin.cpp, modifying the below line within ROIAlign::initialize() helped resolve the error:

//// Orig: mMaxThreadsPerBlock==1024 for Jetson TX2 NX; leads to trtexec inference error
// mMaxThreadsPerBlock = props.maxThreadsPerBlock;

//// Modified: mMaxThreadsPerBlock==512
mMaxThreadsPerBlock = props.maxThreadsPerBlock / 2;

Does Jetson TX2 NX only support 512 “Maximum number of threads per block” in practice, even though deviceQuery reports 1024 (captured below for reference)? Or perhaps there are other factors that raised the error in Step 3?

Device 0: "NVIDIA Tegra X2"
  CUDA Driver Version / Runtime Version          10.2 / 10.2
  CUDA Capability Major/Minor version number:    6.2
  Total amount of global memory:                 3825 MBytes (4011147264 bytes)
  ( 2) Multiprocessors, (128) CUDA Cores/MP:     256 CUDA Cores
  GPU Max Clock rate:                            1300 MHz (1.30 GHz)
  Memory Clock rate:                             1300 Mhz
  Memory Bus Width:                              128-bit
  L2 Cache Size:                                 524288 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 32768
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  **Maximum number of threads per block:           1024**
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 1 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            Yes
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device supports Compute Preemption:            Yes
  Supports Cooperative Kernel Launch:            Yes
  Supports MultiDevice Co-op Kernel Launch:      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 0 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 10.2, CUDA Runtime Version = 10.2, NumDevs = 1
Result = PASS

Could you help look into this issue and let us know? Many thanks for your help!!

Hi,

We are going to give it a try.
Could you share the ONNX model with us as well?

Thanks.

Hi,

Thank you very much for your prompt reply!

I have shared with you my ONNX model via DM.

Hi,

We have downloaded the model and now is working on reproducing the issue.
Will keep you updated.

Thanks.

Hi,

Would you mind sharing the custom roiAlignPlugin for TensorRT 8.2 with us as well?
Thanks.

Hi,

Thank you for your reply. I have shared with you relevant files via DM.

Hi,

Thanks for sharing.

It looks like the limit is from the “Maximum number of threads per multiprocessor”.

We have separated the RoiAlignImpl and run it solely.
It can run normally with 1024 threads but fails when running with the whole TensorRT inference code.

So it’s possible that when running all inferences together, the maximum number of threads per multiprocessor is over 2048.

Thanks.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.