TRT8 breaks DLA

Dear NVIDIA team,

I can’t run even simple, purposly build .onnx on DLA on TRT8. The same onnx works on TRT7.
Onnx:
reid-model.onnx (2.6 MB)

The command:

trtexec --onnx=reid-model.onnx --verbose --useDLACore=0

The output (failed one):

&&&& RUNNING TensorRT.trtexec [TensorRT v8001] # trtexec --onnx=reid-model.onnx --verbose --useDLACore=0
[12/22/2021-18:42:35] [I] === Model Options ===
[12/22/2021-18:42:35] [I] Format: ONNX
[12/22/2021-18:42:35] [I] Model: reid-model.onnx
[12/22/2021-18:42:35] [I] Output:
[12/22/2021-18:42:35] [I] === Build Options ===
[12/22/2021-18:42:35] [I] Max batch: explicit
[12/22/2021-18:42:35] [I] Workspace: 16 MiB
[12/22/2021-18:42:35] [I] minTiming: 1
[12/22/2021-18:42:35] [I] avgTiming: 8
[12/22/2021-18:42:35] [I] Precision: FP32
[12/22/2021-18:42:35] [I] Calibration:
[12/22/2021-18:42:35] [I] Refit: Disabled
[12/22/2021-18:42:35] [I] Sparsity: Disabled
[12/22/2021-18:42:35] [I] Safe mode: Disabled
[12/22/2021-18:42:35] [I] Restricted mode: Disabled
[12/22/2021-18:42:35] [I] Save engine:
[12/22/2021-18:42:35] [I] Load engine:
[12/22/2021-18:42:35] [I] NVTX verbosity: 0
[12/22/2021-18:42:35] [I] Tactic sources: Using default tactic sources
[12/22/2021-18:42:35] [I] timingCacheMode: local
[12/22/2021-18:42:35] [I] timingCacheFile:
[12/22/2021-18:42:35] [I] Input(s)s format: fp32:CHW
[12/22/2021-18:42:35] [I] Output(s)s format: fp32:CHW
[12/22/2021-18:42:35] [I] Input build shapes: model
[12/22/2021-18:42:35] [I] Input calibration shapes: model
[12/22/2021-18:42:35] [I] === System Options ===
[12/22/2021-18:42:35] [I] Device: 0
[12/22/2021-18:42:35] [I] DLACore: 0
[12/22/2021-18:42:35] [I] Plugins:
[12/22/2021-18:42:35] [I] === Inference Options ===
[12/22/2021-18:42:35] [I] Batch: Explicit
[12/22/2021-18:42:35] [I] Input inference shapes: model
[12/22/2021-18:42:35] [I] Iterations: 10
[12/22/2021-18:42:35] [I] Duration: 3s (+ 200ms warm up)
[12/22/2021-18:42:35] [I] Sleep time: 0ms
[12/22/2021-18:42:35] [I] Streams: 1
[12/22/2021-18:42:35] [I] ExposeDMA: Disabled
[12/22/2021-18:42:35] [I] Data transfers: Enabled
[12/22/2021-18:42:35] [I] Spin-wait: Disabled
[12/22/2021-18:42:35] [I] Multithreading: Disabled
[12/22/2021-18:42:35] [I] CUDA Graph: Disabled
[12/22/2021-18:42:35] [I] Separate profiling: Disabled
[12/22/2021-18:42:35] [I] Time Deserialize: Disabled
[12/22/2021-18:42:35] [I] Time Refit: Disabled
[12/22/2021-18:42:35] [I] Skip inference: Disabled
[12/22/2021-18:42:35] [I] Inputs:
[12/22/2021-18:42:35] [I] === Reporting Options ===
[12/22/2021-18:42:35] [I] Verbose: Enabled
[12/22/2021-18:42:35] [I] Averages: 10 inferences
[12/22/2021-18:42:35] [I] Percentile: 99
[12/22/2021-18:42:35] [I] Dump refittable layers:Disabled
[12/22/2021-18:42:35] [I] Dump output: Disabled
[12/22/2021-18:42:35] [I] Profile: Disabled
[12/22/2021-18:42:35] [I] Export timing to JSON file:
[12/22/2021-18:42:35] [I] Export output to JSON file:
[12/22/2021-18:42:35] [I] Export profile to JSON file:
[12/22/2021-18:42:35] [I]
[12/22/2021-18:42:35] [I] === Device Information ===
[12/22/2021-18:42:35] [I] Selected Device: Xavier
[12/22/2021-18:42:35] [I] Compute Capability: 7.2
[12/22/2021-18:42:35] [I] SMs: 8
[12/22/2021-18:42:35] [I] Compute Clock Rate: 1.377 GHz
[12/22/2021-18:42:35] [I] Device Global Memory: 31928 MiB
[12/22/2021-18:42:35] [I] Shared Memory per SM: 96 KiB
[12/22/2021-18:42:35] [I] Memory Bus Width: 256 bits (ECC disabled)
[12/22/2021-18:42:35] [I] Memory Clock Rate: 1.377 GHz
[12/22/2021-18:42:35] [I]
[12/22/2021-18:42:35] [I] TensorRT version: 8001
[12/22/2021-18:42:35] [V] [TRT] Registered plugin creator - ::GridAnchor_TRT version 1
[12/22/2021-18:42:35] [V] [TRT] Registered plugin creator - ::GridAnchorRect_TRT version 1
[12/22/2021-18:42:35] [V] [TRT] Registered plugin creator - ::NMS_TRT version 1
[12/22/2021-18:42:35] [V] [TRT] Registered plugin creator - ::Reorg_TRT version 1
[12/22/2021-18:42:35] [V] [TRT] Registered plugin creator - ::Region_TRT version 1
[12/22/2021-18:42:35] [V] [TRT] Registered plugin creator - ::Clip_TRT version 1
[12/22/2021-18:42:35] [V] [TRT] Registered plugin creator - ::LReLU_TRT version 1
[12/22/2021-18:42:35] [V] [TRT] Registered plugin creator - ::PriorBox_TRT version 1
[12/22/2021-18:42:35] [V] [TRT] Registered plugin creator - ::Normalize_TRT version 1
[12/22/2021-18:42:35] [V] [TRT] Registered plugin creator - ::ScatterND version 1
[12/22/2021-18:42:35] [V] [TRT] Registered plugin creator - ::RPROI_TRT version 1
[12/22/2021-18:42:35] [V] [TRT] Registered plugin creator - ::BatchedNMS_TRT version 1
[12/22/2021-18:42:35] [V] [TRT] Registered plugin creator - ::BatchedNMSDynamic_TRT version 1
[12/22/2021-18:42:35] [V] [TRT] Registered plugin creator - ::FlattenConcat_TRT version 1
[12/22/2021-18:42:35] [V] [TRT] Registered plugin creator - ::CropAndResize version 1
[12/22/2021-18:42:35] [V] [TRT] Registered plugin creator - ::DetectionLayer_TRT version 1
[12/22/2021-18:42:35] [V] [TRT] Registered plugin creator - ::EfficientNMS_ONNX_TRT version 1
[12/22/2021-18:42:35] [V] [TRT] Registered plugin creator - ::EfficientNMS_TRT version 1
[12/22/2021-18:42:35] [V] [TRT] Registered plugin creator - ::Proposal version 1
[12/22/2021-18:42:35] [V] [TRT] Registered plugin creator - ::ProposalLayer_TRT version 1
[12/22/2021-18:42:35] [V] [TRT] Registered plugin creator - ::PyramidROIAlign_TRT version 1
[12/22/2021-18:42:35] [V] [TRT] Registered plugin creator - ::ResizeNearest_TRT version 1
[12/22/2021-18:42:35] [V] [TRT] Registered plugin creator - ::Split version 1
[12/22/2021-18:42:35] [V] [TRT] Registered plugin creator - ::SpecialSlice_TRT version 1
[12/22/2021-18:42:35] [V] [TRT] Registered plugin creator - ::InstanceNormalization_TRT version 1
[12/22/2021-18:42:36] [I] [TRT] [MemUsageChange] Init CUDA: CPU +354, GPU +0, now: CPU 372, GPU 19256 (MiB)
[12/22/2021-18:42:36] [I] Start parsing network model
[12/22/2021-18:42:36] [I] [TRT] ----------------------------------------------------------------
[12/22/2021-18:42:36] [I] [TRT] Input filename:   reid-model.onnx
[12/22/2021-18:42:36] [I] [TRT] ONNX IR version:  0.0.6
[12/22/2021-18:42:36] [I] [TRT] Opset version:    11
[12/22/2021-18:42:36] [I] [TRT] Producer name:    tf2onnx
[12/22/2021-18:42:36] [I] [TRT] Producer version: 1.9.2
[12/22/2021-18:42:36] [I] [TRT] Domain:
[12/22/2021-18:42:36] [I] [TRT] Model version:    0
[12/22/2021-18:42:36] [I] [TRT] Doc string:
[12/22/2021-18:42:36] [I] [TRT] ----------------------------------------------------------------
[12/22/2021-18:42:36] [V] [TRT] Plugin creator already registered - ::GridAnchor_TRT version 1
[12/22/2021-18:42:36] [V] [TRT] Plugin creator already registered - ::GridAnchorRect_TRT version 1
[12/22/2021-18:42:36] [V] [TRT] Plugin creator already registered - ::NMS_TRT version 1
[12/22/2021-18:42:36] [V] [TRT] Plugin creator already registered - ::Reorg_TRT version 1
[12/22/2021-18:42:36] [V] [TRT] Plugin creator already registered - ::Region_TRT version 1
[12/22/2021-18:42:36] [V] [TRT] Plugin creator already registered - ::Clip_TRT version 1
[12/22/2021-18:42:36] [V] [TRT] Plugin creator already registered - ::LReLU_TRT version 1
[12/22/2021-18:42:36] [V] [TRT] Plugin creator already registered - ::PriorBox_TRT version 1
[12/22/2021-18:42:36] [V] [TRT] Plugin creator already registered - ::Normalize_TRT version 1
[12/22/2021-18:42:36] [V] [TRT] Plugin creator already registered - ::ScatterND version 1
[12/22/2021-18:42:36] [V] [TRT] Plugin creator already registered - ::RPROI_TRT version 1
[12/22/2021-18:42:36] [V] [TRT] Plugin creator already registered - ::BatchedNMS_TRT version 1
[12/22/2021-18:42:36] [V] [TRT] Plugin creator already registered - ::BatchedNMSDynamic_TRT version 1
[12/22/2021-18:42:36] [V] [TRT] Plugin creator already registered - ::FlattenConcat_TRT version 1
[12/22/2021-18:42:36] [V] [TRT] Plugin creator already registered - ::CropAndResize version 1
[12/22/2021-18:42:36] [V] [TRT] Plugin creator already registered - ::DetectionLayer_TRT version 1
[12/22/2021-18:42:36] [V] [TRT] Plugin creator already registered - ::EfficientNMS_ONNX_TRT version 1
[12/22/2021-18:42:36] [V] [TRT] Plugin creator already registered - ::EfficientNMS_TRT version 1
[12/22/2021-18:42:36] [V] [TRT] Plugin creator already registered - ::Proposal version 1
[12/22/2021-18:42:36] [V] [TRT] Plugin creator already registered - ::ProposalLayer_TRT version 1
[12/22/2021-18:42:36] [V] [TRT] Plugin creator already registered - ::PyramidROIAlign_TRT version 1
[12/22/2021-18:42:36] [V] [TRT] Plugin creator already registered - ::ResizeNearest_TRT version 1
[12/22/2021-18:42:36] [V] [TRT] Plugin creator already registered - ::Split version 1
[12/22/2021-18:42:36] [V] [TRT] Plugin creator already registered - ::SpecialSlice_TRT version 1
[12/22/2021-18:42:36] [V] [TRT] Plugin creator already registered - ::InstanceNormalization_TRT version 1
[12/22/2021-18:42:36] [V] [TRT] Adding network input: serving_default_input_1:0 with dtype: float32, dimensions: (-1, 1024)
[12/22/2021-18:42:36] [V] [TRT] Registering tensor: serving_default_input_1:0 for ONNX tensor: serving_default_input_1:0
[12/22/2021-18:42:36] [V] [TRT] Importing initializer: const_fold_opt__18
[12/22/2021-18:42:36] [V] [TRT] Importing initializer: const_fold_opt__17
[12/22/2021-18:42:36] [V] [TRT] Importing initializer: const_fold_opt__16
[12/22/2021-18:42:36] [V] [TRT] Importing initializer: const_fold_opt__15
[12/22/2021-18:42:36] [V] [TRT] Parsing node: model_1/dense/MatMul;model_1/tf.nn.relu/Relu;model_1/dense/BiasAdd [MatMul]
[12/22/2021-18:42:36] [V] [TRT] Searching for input: serving_default_input_1:0
[12/22/2021-18:42:36] [V] [TRT] Searching for input: const_fold_opt__18
[12/22/2021-18:42:36] [V] [TRT] model_1/dense/MatMul;model_1/tf.nn.relu/Relu;model_1/dense/BiasAdd [MatMul] inputs: [serving_default_input_1:0 -> (-1, 1024)[FLOAT]], [const_fold_opt__18 -> (1024, 512)[FLOAT]],
[12/22/2021-18:42:36] [V] [TRT] Registering layer: const_fold_opt__18 for ONNX node: const_fold_opt__18
[12/22/2021-18:42:36] [V] [TRT] GEMM: using FC layer instead of MM because all criteria were met.
[12/22/2021-18:42:36] [W] [TRT] onnx2trt_utils.cpp:364: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
[12/22/2021-18:42:36] [V] [TRT] Original shape: (_, 1024), unsqueezing to: (_, _, _, _)
[12/22/2021-18:42:36] [W] [TRT] ShapedWeights.cpp:173: Weights const_fold_opt__18 has been transposed with permutation of (1, 0)! If you plan on overwriting the weights with the Refitter API, the new weights must be pre-transposed.
[12/22/2021-18:42:36] [V] [TRT] Registering layer: model_1/dense/MatMul;model_1/tf.nn.relu/Relu;model_1/dense/BiasAdd for ONNX node: model_1/dense/MatMul;model_1/tf.nn.relu/Relu;model_1/dense/BiasAdd
[12/22/2021-18:42:36] [V] [TRT] Original shape: (_, 512, 1, 1), squeezing to: (_, _)
[12/22/2021-18:42:36] [V] [TRT] Registering tensor: model_1/dense/MatMul;model_1/tf.nn.relu/Relu;model_1/dense/BiasAdd for ONNX tensor: model_1/dense/MatMul;model_1/tf.nn.relu/Relu;model_1/dense/BiasAdd
[12/22/2021-18:42:36] [V] [TRT] model_1/dense/MatMul;model_1/tf.nn.relu/Relu;model_1/dense/BiasAdd [MatMul] outputs: [model_1/dense/MatMul;model_1/tf.nn.relu/Relu;model_1/dense/BiasAdd -> (-1, 512)[FLOAT]],
[12/22/2021-18:42:36] [V] [TRT] Parsing node: Relu__5 [Relu]
[12/22/2021-18:42:36] [V] [TRT] Searching for input: model_1/dense/MatMul;model_1/tf.nn.relu/Relu;model_1/dense/BiasAdd
[12/22/2021-18:42:36] [V] [TRT] Relu__5 [Relu] inputs: [model_1/dense/MatMul;model_1/tf.nn.relu/Relu;model_1/dense/BiasAdd -> (-1, 512)[FLOAT]],
[12/22/2021-18:42:36] [V] [TRT] Registering layer: Relu__5 for ONNX node: Relu__5
[12/22/2021-18:42:36] [V] [TRT] Registering tensor: Relu__5:0 for ONNX tensor: Relu__5:0
[12/22/2021-18:42:36] [V] [TRT] Relu__5 [Relu] outputs: [Relu__5:0 -> (-1, 512)[FLOAT]],
[12/22/2021-18:42:36] [V] [TRT] Parsing node: model_1/batch_normalization/batchnorm/mul_1;model_1/batch_normalization/batchnorm/add_1;model_1/dense_1/MatMul;model_1/tf.nn.relu_1/Relu;model_1/dense_1/BiasAdd1 [MatMul]
[12/22/2021-18:42:36] [V] [TRT] Searching for input: Relu__5:0
[12/22/2021-18:42:36] [V] [TRT] Searching for input: const_fold_opt__16
[12/22/2021-18:42:36] [V] [TRT] model_1/batch_normalization/batchnorm/mul_1;model_1/batch_normalization/batchnorm/add_1;model_1/dense_1/MatMul;model_1/tf.nn.relu_1/Relu;model_1/dense_1/BiasAdd1 [MatMul] inputs: [Relu__5:0 -> (-1, 512)[FLOAT]], [const_fold_opt__16 -> (512, 256)[FLOAT]],
[12/22/2021-18:42:36] [V] [TRT] Registering layer: const_fold_opt__16 for ONNX node: const_fold_opt__16
[12/22/2021-18:42:36] [V] [TRT] GEMM: using FC layer instead of MM because all criteria were met.
[12/22/2021-18:42:36] [V] [TRT] Original shape: (_, 512), unsqueezing to: (_, _, _, _)
[12/22/2021-18:42:36] [W] [TRT] ShapedWeights.cpp:173: Weights const_fold_opt__16 has been transposed with permutation of (1, 0)! If you plan on overwriting the weights with the Refitter API, the new weights must be pre-transposed.
[12/22/2021-18:42:36] [V] [TRT] Registering layer: model_1/batch_normalization/batchnorm/mul_1;model_1/batch_normalization/batchnorm/add_1;model_1/dense_1/MatMul;model_1/tf.nn.relu_1/Relu;model_1/dense_1/BiasAdd1 for ONNX node: model_1/batch_normalization/batchnorm/mul_1;model_1/batch_normalization/batchnorm/add_1;model_1/dense_1/MatMul;model_1/tf.nn.relu_1/Relu;model_1/dense_1/BiasAdd1
[12/22/2021-18:42:36] [V] [TRT] Original shape: (_, 256, 1, 1), squeezing to: (_, _)
[12/22/2021-18:42:36] [V] [TRT] Registering tensor: model_1/batch_normalization/batchnorm/mul_1;model_1/batch_normalization/batchnorm/add_1;model_1/dense_1/MatMul;model_1/tf.nn.relu_1/Relu;model_1/dense_1/BiasAdd1 for ONNX tensor: model_1/batch_normalization/batchnorm/mul_1;model_1/batch_normalization/batchnorm/add_1;model_1/dense_1/MatMul;model_1/tf.nn.relu_1/Relu;model_1/dense_1/BiasAdd1
[12/22/2021-18:42:36] [V] [TRT] model_1/batch_normalization/batchnorm/mul_1;model_1/batch_normalization/batchnorm/add_1;model_1/dense_1/MatMul;model_1/tf.nn.relu_1/Relu;model_1/dense_1/BiasAdd1 [MatMul] outputs: [model_1/batch_normalization/batchnorm/mul_1;model_1/batch_normalization/batchnorm/add_1;model_1/dense_1/MatMul;model_1/tf.nn.relu_1/Relu;model_1/dense_1/BiasAdd1 -> (-1, 256)[FLOAT]],
[12/22/2021-18:42:36] [V] [TRT] Parsing node: Relu__8 [Relu]
[12/22/2021-18:42:36] [V] [TRT] Searching for input: model_1/batch_normalization/batchnorm/mul_1;model_1/batch_normalization/batchnorm/add_1;model_1/dense_1/MatMul;model_1/tf.nn.relu_1/Relu;model_1/dense_1/BiasAdd1
[12/22/2021-18:42:36] [V] [TRT] Relu__8 [Relu] inputs: [model_1/batch_normalization/batchnorm/mul_1;model_1/batch_normalization/batchnorm/add_1;model_1/dense_1/MatMul;model_1/tf.nn.relu_1/Relu;model_1/dense_1/BiasAdd1 -> (-1, 256)[FLOAT]],
[12/22/2021-18:42:36] [V] [TRT] Registering layer: Relu__8 for ONNX node: Relu__8
[12/22/2021-18:42:36] [V] [TRT] Registering tensor: Relu__8:0 for ONNX tensor: Relu__8:0
[12/22/2021-18:42:36] [V] [TRT] Relu__8 [Relu] outputs: [Relu__8:0 -> (-1, 256)[FLOAT]],
[12/22/2021-18:42:36] [V] [TRT] Parsing node: model_1/batch_normalization_1/batchnorm/mul_1;model_1/batch_normalization_1/batchnorm/add_1;model_1/dense_2/MatMul;model_1/tf.nn.relu_2/Relu;model_1/dense_2/BiasAdd1 [MatMul]
[12/22/2021-18:42:36] [V] [TRT] Searching for input: Relu__8:0
[12/22/2021-18:42:36] [V] [TRT] Searching for input: const_fold_opt__15
[12/22/2021-18:42:36] [V] [TRT] model_1/batch_normalization_1/batchnorm/mul_1;model_1/batch_normalization_1/batchnorm/add_1;model_1/dense_2/MatMul;model_1/tf.nn.relu_2/Relu;model_1/dense_2/BiasAdd1 [MatMul] inputs: [Relu__8:0 -> (-1, 256)[FLOAT]], [const_fold_opt__15 -> (256, 128)[FLOAT]],
[12/22/2021-18:42:36] [V] [TRT] Registering layer: const_fold_opt__15 for ONNX node: const_fold_opt__15
[12/22/2021-18:42:36] [V] [TRT] GEMM: using FC layer instead of MM because all criteria were met.
[12/22/2021-18:42:36] [V] [TRT] Original shape: (_, 256), unsqueezing to: (_, _, _, _)
[12/22/2021-18:42:36] [W] [TRT] ShapedWeights.cpp:173: Weights const_fold_opt__15 has been transposed with permutation of (1, 0)! If you plan on overwriting the weights with the Refitter API, the new weights must be pre-transposed.
[12/22/2021-18:42:36] [V] [TRT] Registering layer: model_1/batch_normalization_1/batchnorm/mul_1;model_1/batch_normalization_1/batchnorm/add_1;model_1/dense_2/MatMul;model_1/tf.nn.relu_2/Relu;model_1/dense_2/BiasAdd1 for ONNX node: model_1/batch_normalization_1/batchnorm/mul_1;model_1/batch_normalization_1/batchnorm/add_1;model_1/dense_2/MatMul;model_1/tf.nn.relu_2/Relu;model_1/dense_2/BiasAdd1
[12/22/2021-18:42:36] [V] [TRT] Original shape: (_, 128, 1, 1), squeezing to: (_, _)
[12/22/2021-18:42:36] [V] [TRT] Registering tensor: model_1/batch_normalization_1/batchnorm/mul_1;model_1/batch_normalization_1/batchnorm/add_1;model_1/dense_2/MatMul;model_1/tf.nn.relu_2/Relu;model_1/dense_2/BiasAdd1 for ONNX tensor: model_1/batch_normalization_1/batchnorm/mul_1;model_1/batch_normalization_1/batchnorm/add_1;model_1/dense_2/MatMul;model_1/tf.nn.relu_2/Relu;model_1/dense_2/BiasAdd1
[12/22/2021-18:42:36] [V] [TRT] model_1/batch_normalization_1/batchnorm/mul_1;model_1/batch_normalization_1/batchnorm/add_1;model_1/dense_2/MatMul;model_1/tf.nn.relu_2/Relu;model_1/dense_2/BiasAdd1 [MatMul] outputs: [model_1/batch_normalization_1/batchnorm/mul_1;model_1/batch_normalization_1/batchnorm/add_1;model_1/dense_2/MatMul;model_1/tf.nn.relu_2/Relu;model_1/dense_2/BiasAdd1 -> (-1, 128)[FLOAT]],
[12/22/2021-18:42:36] [V] [TRT] Parsing node: Relu__11 [Relu]
[12/22/2021-18:42:36] [V] [TRT] Searching for input: model_1/batch_normalization_1/batchnorm/mul_1;model_1/batch_normalization_1/batchnorm/add_1;model_1/dense_2/MatMul;model_1/tf.nn.relu_2/Relu;model_1/dense_2/BiasAdd1
[12/22/2021-18:42:36] [V] [TRT] Relu__11 [Relu] inputs: [model_1/batch_normalization_1/batchnorm/mul_1;model_1/batch_normalization_1/batchnorm/add_1;model_1/dense_2/MatMul;model_1/tf.nn.relu_2/Relu;model_1/dense_2/BiasAdd1 -> (-1, 128)[FLOAT]],
[12/22/2021-18:42:36] [V] [TRT] Registering layer: Relu__11 for ONNX node: Relu__11
[12/22/2021-18:42:36] [V] [TRT] Registering tensor: Relu__11:0 for ONNX tensor: Relu__11:0
[12/22/2021-18:42:36] [V] [TRT] Relu__11 [Relu] outputs: [Relu__11:0 -> (-1, 128)[FLOAT]],
[12/22/2021-18:42:36] [V] [TRT] Parsing node: StatefulPartitionedCall:0 [MatMul]
[12/22/2021-18:42:36] [V] [TRT] Searching for input: Relu__11:0
[12/22/2021-18:42:36] [V] [TRT] Searching for input: const_fold_opt__17
[12/22/2021-18:42:36] [V] [TRT] StatefulPartitionedCall:0 [MatMul] inputs: [Relu__11:0 -> (-1, 128)[FLOAT]], [const_fold_opt__17 -> (128, 32)[FLOAT]],
[12/22/2021-18:42:36] [V] [TRT] Registering layer: const_fold_opt__17 for ONNX node: const_fold_opt__17
[12/22/2021-18:42:36] [V] [TRT] GEMM: using FC layer instead of MM because all criteria were met.
[12/22/2021-18:42:36] [V] [TRT] Original shape: (_, 128), unsqueezing to: (_, _, _, _)
[12/22/2021-18:42:36] [W] [TRT] ShapedWeights.cpp:173: Weights const_fold_opt__17 has been transposed with permutation of (1, 0)! If you plan on overwriting the weights with the Refitter API, the new weights must be pre-transposed.
[12/22/2021-18:42:36] [V] [TRT] Registering layer: StatefulPartitionedCall:0 for ONNX node: StatefulPartitionedCall:0
[12/22/2021-18:42:36] [V] [TRT] Original shape: (_, 32, 1, 1), squeezing to: (_, _)
[12/22/2021-18:42:36] [V] [TRT] Registering tensor: StatefulPartitionedCall:0_0 for ONNX tensor: StatefulPartitionedCall:0
[12/22/2021-18:42:36] [V] [TRT] StatefulPartitionedCall:0 [MatMul] outputs: [StatefulPartitionedCall:0 -> (-1, 32)[FLOAT]],
[12/22/2021-18:42:36] [V] [TRT] Marking StatefulPartitionedCall:0_0 as output: StatefulPartitionedCall:0
[12/22/2021-18:42:36] [I] Finish parsing network model
[12/22/2021-18:42:36] [I] [TRT] [MemUsageChange] Init CUDA: CPU +0, GPU +0, now: CPU 377, GPU 19266 (MiB)
[12/22/2021-18:42:36] [W] Dynamic dimensions required for input: serving_default_input_1:0, but no shapes were provided. Automatically overriding shape to: 1x1024
[12/22/2021-18:42:36] [W] [TRT] (Unnamed Layer* 3) [Concatenation]: DLA only supports concatenation on the C dimension.
[12/22/2021-18:42:36] [W] [TRT] (Unnamed Layer* 16) [Concatenation]: DLA only supports concatenation on the C dimension.
[12/22/2021-18:42:36] [W] [TRT] (Unnamed Layer* 29) [Concatenation]: DLA only supports concatenation on the C dimension.
[12/22/2021-18:42:36] [W] [TRT] (Unnamed Layer* 42) [Concatenation]: DLA only supports concatenation on the C dimension.
[12/22/2021-18:42:36] [I] [TRT] [MemUsageSnapshot] Builder begin: CPU 377 MiB, GPU 19266 MiB
[12/22/2021-18:42:36] [V] [TRT] Applying generic optimizations to the graph for inference.
[12/22/2021-18:42:36] [V] [TRT] Original: 15 layers
[12/22/2021-18:42:36] [V] [TRT] After dead-layer removal: 15 layers
[12/22/2021-18:42:36] [V] [TRT] After Myelin optimization: 15 layers
[12/22/2021-18:42:36] [W] [TRT] Input tensor has less than 4 dimensions for Relu__5. At least one shuffle layer will be inserted which cannot run on DLA.
[12/22/2021-18:42:36] [W] [TRT] Input tensor has less than 4 dimensions for Relu__8. At least one shuffle layer will be inserted which cannot run on DLA.
[12/22/2021-18:42:36] [W] [TRT] Input tensor has less than 4 dimensions for Relu__11. At least one shuffle layer will be inserted which cannot run on DLA.
[12/22/2021-18:42:36] [V] [TRT] After DLA optimization: 21 layers
[12/22/2021-18:42:36] [V] [TRT] After scale fusion: 21 layers
[12/22/2021-18:42:36] [V] [TRT] ShuffleShuffleFusion: Fusing (Unnamed Layer* 11) [Shuffle] with shuffle_model_1/dense/MatMul;model_1/tf.nn.relu/Relu;model_1/dense/BiasAdd
[12/22/2021-18:42:36] [V] [TRT] ShuffleShuffleFusion: Fusing shuffle_Relu__5:0 with (Unnamed Layer* 19) [Shuffle]
[12/22/2021-18:42:36] [V] [TRT] ShuffleShuffleFusion: Fusing (Unnamed Layer* 24) [Shuffle] with shuffle_model_1/batch_normalization/batchnorm/mul_1;model_1/batch_normalization/batchnorm/add_1;model_1/dense_1/MatMul;model_1/tf.nn.relu_1/Relu;model_1/dense_1/BiasAdd1
[12/22/2021-18:42:36] [V] [TRT] ShuffleShuffleFusion: Fusing shuffle_Relu__8:0 with (Unnamed Layer* 32) [Shuffle]
[12/22/2021-18:42:36] [V] [TRT] ShuffleShuffleFusion: Fusing (Unnamed Layer* 37) [Shuffle] with shuffle_model_1/batch_normalization_1/batchnorm/mul_1;model_1/batch_normalization_1/batchnorm/add_1;model_1/dense_2/MatMul;model_1/tf.nn.relu_2/Relu;model_1/dense_2/BiasAdd1
[12/22/2021-18:42:36] [V] [TRT] ShuffleShuffleFusion: Fusing shuffle_Relu__11:0 with (Unnamed Layer* 45) [Shuffle]
[12/22/2021-18:42:36] [V] [TRT] After vertical fusions: 15 layers
[12/22/2021-18:42:36] [V] [TRT] After dupe layer removal: 15 layers
[12/22/2021-18:42:36] [V] [TRT] After final dead-layer removal: 15 layers
[12/22/2021-18:42:36] [V] [TRT] After tensor merging: 15 layers
[12/22/2021-18:42:36] [V] [TRT] After concat removal: 15 layers
[12/22/2021-18:42:36] [V] [TRT] Graph construction and optimization completed in 0.0194514 seconds.
[12/22/2021-18:42:36] [E] Error[9]: [standardEngineBuilder.cpp::isValidDLAConfig::2189] Error Code 9: Internal Error (Default DLA is enabled but layer (Unnamed Layer* 6) [Shuffle] is not supported on DLA and falling back to GPU is not enabled.)
[12/22/2021-18:42:36] [E] Error[2]: [builder.cpp::buildSerializedNetwork::417] Error Code 2: Internal Error (Assertion enginePtr != nullptr failed.)
Segmentation fault (core dumped)

Thanks for help.

Hi,

[Shuffle] is not supported on DLA and falling back to GPU is not enabled.)

This error indicates that the shuffle layer cannot run on DLA.
Please enable the fallback to allow the layer run on GPU.

For example, we can run your model after adding --allowGPUFallback with TensorRT 8.0.

$ /usr/src/tensorrt/bin/trtexec --onnx=reid-model.onnx --useDLACore=0 --allowGPUFallback
&&&& RUNNING TensorRT.trtexec [TensorRT v8001] # /usr/src/tensorrt/bin/trtexec --onnx=reid-model.onnx --useDLACore=0 --allowGPUFallback
[12/22/2021-22:09:42] [I] === Model Options ===
[12/22/2021-22:09:42] [I] Format: ONNX
[12/22/2021-22:09:42] [I] Model: reid-model.onnx
[12/22/2021-22:09:42] [I] Output:
[12/22/2021-22:09:42] [I] === Build Options ===
[12/22/2021-22:09:42] [I] Max batch: explicit
[12/22/2021-22:09:42] [I] Workspace: 16 MiB
[12/22/2021-22:09:42] [I] minTiming: 1
[12/22/2021-22:09:42] [I] avgTiming: 8
[12/22/2021-22:09:42] [I] Precision: FP32
[12/22/2021-22:09:42] [I] Calibration:
[12/22/2021-22:09:42] [I] Refit: Disabled
[12/22/2021-22:09:42] [I] Sparsity: Disabled
[12/22/2021-22:09:42] [I] Safe mode: Disabled
[12/22/2021-22:09:42] [I] Restricted mode: Disabled
[12/22/2021-22:09:42] [I] Save engine:
[12/22/2021-22:09:42] [I] Load engine:
[12/22/2021-22:09:42] [I] NVTX verbosity: 0
[12/22/2021-22:09:42] [I] Tactic sources: Using default tactic sources
[12/22/2021-22:09:42] [I] timingCacheMode: local
[12/22/2021-22:09:42] [I] timingCacheFile:
[12/22/2021-22:09:42] [I] Input(s)s format: fp32:CHW
[12/22/2021-22:09:42] [I] Output(s)s format: fp32:CHW
[12/22/2021-22:09:42] [I] Input build shapes: model
[12/22/2021-22:09:42] [I] Input calibration shapes: model
[12/22/2021-22:09:42] [I] === System Options ===
[12/22/2021-22:09:42] [I] Device: 0
[12/22/2021-22:09:42] [I] DLACore: 0(With GPU fallback)
[12/22/2021-22:09:42] [I] Plugins:
[12/22/2021-22:09:42] [I] === Inference Options ===
[12/22/2021-22:09:42] [I] Batch: Explicit
[12/22/2021-22:09:42] [I] Input inference shapes: model
[12/22/2021-22:09:42] [I] Iterations: 10
[12/22/2021-22:09:42] [I] Duration: 3s (+ 200ms warm up)
[12/22/2021-22:09:42] [I] Sleep time: 0ms
[12/22/2021-22:09:42] [I] Streams: 1
[12/22/2021-22:09:42] [I] ExposeDMA: Disabled
[12/22/2021-22:09:42] [I] Data transfers: Enabled
[12/22/2021-22:09:42] [I] Spin-wait: Disabled
[12/22/2021-22:09:42] [I] Multithreading: Disabled
[12/22/2021-22:09:42] [I] CUDA Graph: Disabled
[12/22/2021-22:09:42] [I] Separate profiling: Disabled
[12/22/2021-22:09:42] [I] Time Deserialize: Disabled
[12/22/2021-22:09:42] [I] Time Refit: Disabled
[12/22/2021-22:09:42] [I] Skip inference: Disabled
[12/22/2021-22:09:42] [I] Inputs:
[12/22/2021-22:09:42] [I] === Reporting Options ===
[12/22/2021-22:09:42] [I] Verbose: Disabled
[12/22/2021-22:09:42] [I] Averages: 10 inferences
[12/22/2021-22:09:42] [I] Percentile: 99
[12/22/2021-22:09:42] [I] Dump refittable layers:Disabled
[12/22/2021-22:09:42] [I] Dump output: Disabled
[12/22/2021-22:09:42] [I] Profile: Disabled
[12/22/2021-22:09:42] [I] Export timing to JSON file:
[12/22/2021-22:09:42] [I] Export output to JSON file:
[12/22/2021-22:09:42] [I] Export profile to JSON file:
[12/22/2021-22:09:42] [I]
[12/22/2021-22:09:42] [I] === Device Information ===
[12/22/2021-22:09:42] [I] Selected Device: Xavier
[12/22/2021-22:09:42] [I] Compute Capability: 7.2
[12/22/2021-22:09:42] [I] SMs: 8
[12/22/2021-22:09:42] [I] Compute Clock Rate: 1.377 GHz
[12/22/2021-22:09:42] [I] Device Global Memory: 31920 MiB
[12/22/2021-22:09:42] [I] Shared Memory per SM: 96 KiB
[12/22/2021-22:09:42] [I] Memory Bus Width: 256 bits (ECC disabled)
[12/22/2021-22:09:42] [I] Memory Clock Rate: 1.377 GHz
[12/22/2021-22:09:42] [I]
[12/22/2021-22:09:42] [I] TensorRT version: 8001
[12/22/2021-22:09:43] [I] [TRT] [MemUsageChange] Init CUDA: CPU +353, GPU +0, now: CPU 371, GPU 17815 (MiB)
[12/22/2021-22:09:43] [I] Start parsing network model
[12/22/2021-22:09:43] [I] [TRT] ----------------------------------------------------------------
[12/22/2021-22:09:43] [I] [TRT] Input filename:   reid-model.onnx
[12/22/2021-22:09:43] [I] [TRT] ONNX IR version:  0.0.6
[12/22/2021-22:09:43] [I] [TRT] Opset version:    11
[12/22/2021-22:09:43] [I] [TRT] Producer name:    tf2onnx
[12/22/2021-22:09:43] [I] [TRT] Producer version: 1.9.2
[12/22/2021-22:09:43] [I] [TRT] Domain:
[12/22/2021-22:09:43] [I] [TRT] Model version:    0
[12/22/2021-22:09:43] [I] [TRT] Doc string:
[12/22/2021-22:09:43] [I] [TRT] ----------------------------------------------------------------
[12/22/2021-22:09:43] [W] [TRT] onnx2trt_utils.cpp:364: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
[12/22/2021-22:09:43] [W] [TRT] ShapedWeights.cpp:173: Weights const_fold_opt__18 has been transposed with permutation of (1, 0)! If you plan on overwriting the weights with the Refitter API, the new weights must be pre-transposed.
[12/22/2021-22:09:43] [W] [TRT] ShapedWeights.cpp:173: Weights const_fold_opt__16 has been transposed with permutation of (1, 0)! If you plan on overwriting the weights with the Refitter API, the new weights must be pre-transposed.
[12/22/2021-22:09:43] [W] [TRT] ShapedWeights.cpp:173: Weights const_fold_opt__15 has been transposed with permutation of (1, 0)! If you plan on overwriting the weights with the Refitter API, the new weights must be pre-transposed.
[12/22/2021-22:09:43] [W] [TRT] ShapedWeights.cpp:173: Weights const_fold_opt__17 has been transposed with permutation of (1, 0)! If you plan on overwriting the weights with the Refitter API, the new weights must be pre-transposed.
[12/22/2021-22:09:43] [I] Finish parsing network model
[12/22/2021-22:09:43] [I] [TRT] [MemUsageChange] Init CUDA: CPU +0, GPU +0, now: CPU 377, GPU 17824 (MiB)
[12/22/2021-22:09:43] [W] Dynamic dimensions required for input: serving_default_input_1:0, but no shapes were provided. Automatically overriding shape to: 1x1024
[12/22/2021-22:09:43] [W] [TRT] Default DLA is enabled but layer const_fold_opt__18 is not supported on DLA, falling back to GPU.
[12/22/2021-22:09:43] [W] [TRT] DLA only supports FP16 and Int8 precision type. Switching (Unnamed Layer* 1) [Shape] device type to GPU.
[12/22/2021-22:09:43] [W] [TRT] Default DLA is enabled but layer (Unnamed Layer* 2) [Constant] is not supported on DLA, falling back to GPU.
[12/22/2021-22:09:43] [W] [TRT] (Unnamed Layer* 3) [Concatenation]: DLA only supports concatenation on the C dimension.
[12/22/2021-22:09:43] [W] [TRT] Default DLA is enabled but layer (Unnamed Layer* 3) [Concatenation] is not supported on DLA, falling back to GPU.
[12/22/2021-22:09:43] [W] [TRT] Default DLA is enabled but layer (Unnamed Layer* 4) [Constant] is not supported on DLA, falling back to GPU.
[12/22/2021-22:09:43] [W] [TRT] Default DLA is enabled but layer (Unnamed Layer* 5) [Gather] is not supported on DLA, falling back to GPU.
[12/22/2021-22:09:43] [W] [TRT] Default DLA is enabled but layer (Unnamed Layer* 6) [Shuffle] is not supported on DLA, falling back to GPU.
[12/22/2021-22:09:43] [W] [TRT] DLA only supports FP16 and Int8 precision type. Switching (Unnamed Layer* 8) [Shape] device type to GPU.
[12/22/2021-22:09:43] [W] [TRT] Default DLA is enabled but layer (Unnamed Layer* 9) [Constant] is not supported on DLA, falling back to GPU.
[12/22/2021-22:09:43] [W] [TRT] Default DLA is enabled but layer (Unnamed Layer* 10) [Gather] is not supported on DLA, falling back to GPU.
[12/22/2021-22:09:43] [W] [TRT] Default DLA is enabled but layer (Unnamed Layer* 11) [Shuffle] is not supported on DLA, falling back to GPU.
[12/22/2021-22:09:43] [W] [TRT] Default DLA is enabled but layer const_fold_opt__16 is not supported on DLA, falling back to GPU.
[12/22/2021-22:09:43] [W] [TRT] DLA only supports FP16 and Int8 precision type. Switching (Unnamed Layer* 14) [Shape] device type to GPU.
[12/22/2021-22:09:43] [W] [TRT] Default DLA is enabled but layer (Unnamed Layer* 15) [Constant] is not supported on DLA, falling back to GPU.
[12/22/2021-22:09:43] [W] [TRT] (Unnamed Layer* 16) [Concatenation]: DLA only supports concatenation on the C dimension.
[12/22/2021-22:09:43] [W] [TRT] Default DLA is enabled but layer (Unnamed Layer* 16) [Concatenation] is not supported on DLA, falling back to GPU.
[12/22/2021-22:09:43] [W] [TRT] Default DLA is enabled but layer (Unnamed Layer* 17) [Constant] is not supported on DLA, falling back to GPU.
[12/22/2021-22:09:43] [W] [TRT] Default DLA is enabled but layer (Unnamed Layer* 18) [Gather] is not supported on DLA, falling back to GPU.
[12/22/2021-22:09:43] [W] [TRT] Default DLA is enabled but layer (Unnamed Layer* 19) [Shuffle] is not supported on DLA, falling back to GPU.
[12/22/2021-22:09:43] [W] [TRT] DLA only supports FP16 and Int8 precision type. Switching (Unnamed Layer* 21) [Shape] device type to GPU.
[12/22/2021-22:09:43] [W] [TRT] Default DLA is enabled but layer (Unnamed Layer* 22) [Constant] is not supported on DLA, falling back to GPU.
[12/22/2021-22:09:43] [W] [TRT] Default DLA is enabled but layer (Unnamed Layer* 23) [Gather] is not supported on DLA, falling back to GPU.
[12/22/2021-22:09:43] [W] [TRT] Default DLA is enabled but layer (Unnamed Layer* 24) [Shuffle] is not supported on DLA, falling back to GPU.
[12/22/2021-22:09:43] [W] [TRT] Default DLA is enabled but layer const_fold_opt__15 is not supported on DLA, falling back to GPU.
[12/22/2021-22:09:43] [W] [TRT] DLA only supports FP16 and Int8 precision type. Switching (Unnamed Layer* 27) [Shape] device type to GPU.
[12/22/2021-22:09:43] [W] [TRT] Default DLA is enabled but layer (Unnamed Layer* 28) [Constant] is not supported on DLA, falling back to GPU.
[12/22/2021-22:09:43] [W] [TRT] (Unnamed Layer* 29) [Concatenation]: DLA only supports concatenation on the C dimension.
[12/22/2021-22:09:43] [W] [TRT] Default DLA is enabled but layer (Unnamed Layer* 29) [Concatenation] is not supported on DLA, falling back to GPU.
[12/22/2021-22:09:43] [W] [TRT] Default DLA is enabled but layer (Unnamed Layer* 30) [Constant] is not supported on DLA, falling back to GPU.
[12/22/2021-22:09:43] [W] [TRT] Default DLA is enabled but layer (Unnamed Layer* 31) [Gather] is not supported on DLA, falling back to GPU.
[12/22/2021-22:09:43] [W] [TRT] Default DLA is enabled but layer (Unnamed Layer* 32) [Shuffle] is not supported on DLA, falling back to GPU.
[12/22/2021-22:09:43] [W] [TRT] DLA only supports FP16 and Int8 precision type. Switching (Unnamed Layer* 34) [Shape] device type to GPU.
[12/22/2021-22:09:43] [W] [TRT] Default DLA is enabled but layer (Unnamed Layer* 35) [Constant] is not supported on DLA, falling back to GPU.
[12/22/2021-22:09:43] [W] [TRT] Default DLA is enabled but layer (Unnamed Layer* 36) [Gather] is not supported on DLA, falling back to GPU.
[12/22/2021-22:09:43] [W] [TRT] Default DLA is enabled but layer (Unnamed Layer* 37) [Shuffle] is not supported on DLA, falling back to GPU.
[12/22/2021-22:09:43] [W] [TRT] Default DLA is enabled but layer const_fold_opt__17 is not supported on DLA, falling back to GPU.
[12/22/2021-22:09:43] [W] [TRT] DLA only supports FP16 and Int8 precision type. Switching (Unnamed Layer* 40) [Shape] device type to GPU.
[12/22/2021-22:09:43] [W] [TRT] Default DLA is enabled but layer (Unnamed Layer* 41) [Constant] is not supported on DLA, falling back to GPU.
[12/22/2021-22:09:43] [W] [TRT] (Unnamed Layer* 42) [Concatenation]: DLA only supports concatenation on the C dimension.
[12/22/2021-22:09:43] [W] [TRT] Default DLA is enabled but layer (Unnamed Layer* 42) [Concatenation] is not supported on DLA, falling back to GPU.
[12/22/2021-22:09:43] [W] [TRT] Default DLA is enabled but layer (Unnamed Layer* 43) [Constant] is not supported on DLA, falling back to GPU.
[12/22/2021-22:09:43] [W] [TRT] Default DLA is enabled but layer (Unnamed Layer* 44) [Gather] is not supported on DLA, falling back to GPU.
[12/22/2021-22:09:43] [W] [TRT] Default DLA is enabled but layer (Unnamed Layer* 45) [Shuffle] is not supported on DLA, falling back to GPU.
[12/22/2021-22:09:43] [W] [TRT] DLA only supports FP16 and Int8 precision type. Switching (Unnamed Layer* 47) [Shape] device type to GPU.
[12/22/2021-22:09:43] [W] [TRT] Default DLA is enabled but layer (Unnamed Layer* 48) [Constant] is not supported on DLA, falling back to GPU.
[12/22/2021-22:09:43] [W] [TRT] Default DLA is enabled but layer (Unnamed Layer* 49) [Gather] is not supported on DLA, falling back to GPU.
[12/22/2021-22:09:43] [W] [TRT] Default DLA is enabled but layer (Unnamed Layer* 50) [Shuffle] is not supported on DLA, falling back to GPU.
[12/22/2021-22:09:43] [I] [TRT] [MemUsageSnapshot] Builder begin: CPU 377 MiB, GPU 17824 MiB
[12/22/2021-22:09:43] [W] [TRT] Input tensor has less than 4 dimensions for Relu__5. At least one shuffle layer will be inserted which cannot run on DLA.
[12/22/2021-22:09:43] [W] [TRT] Input tensor has less than 4 dimensions for Relu__8. At least one shuffle layer will be inserted which cannot run on DLA.
[12/22/2021-22:09:43] [W] [TRT] Input tensor has less than 4 dimensions for Relu__11. At least one shuffle layer will be inserted which cannot run on DLA.
[12/22/2021-22:09:43] [I] [TRT] ---------- Layers Running on DLA ----------
[12/22/2021-22:09:43] [I] [TRT] [DlaLayer] {ForeignNode[model_1/dense/MatMul;model_1/tf.nn.relu/Relu;model_1/dense/BiasAdd]}
[12/22/2021-22:09:43] [I] [TRT] [DlaLayer] {ForeignNode[Relu__5]}
[12/22/2021-22:09:43] [I] [TRT] [DlaLayer] {ForeignNode[model_1/batch_normalization/batchnorm/mul_1;model_1/batch_normalization/batchnorm/add_1;model_1/dense_1/MatMul;model_1/tf.nn.relu_1/Relu;model_1/dense_1/BiasAdd1]}
[12/22/2021-22:09:43] [I] [TRT] [DlaLayer] {ForeignNode[Relu__8]}
[12/22/2021-22:09:43] [I] [TRT] [DlaLayer] {ForeignNode[model_1/batch_normalization_1/batchnorm/mul_1;model_1/batch_normalization_1/batchnorm/add_1;model_1/dense_2/MatMul;model_1/tf.nn.relu_2/Relu;model_1/dense_2/BiasAdd1]}
[12/22/2021-22:09:43] [I] [TRT] [DlaLayer] {ForeignNode[Relu__11]}
[12/22/2021-22:09:43] [I] [TRT] [DlaLayer] {ForeignNode[StatefulPartitionedCall:0]}
[12/22/2021-22:09:43] [I] [TRT] ---------- Layers Running on GPU ----------
[12/22/2021-22:09:43] [I] [TRT] [GpuLayer] (Unnamed Layer* 6) [Shuffle]
[12/22/2021-22:09:43] [I] [TRT] [GpuLayer] (Unnamed Layer* 11) [Shuffle] + shuffle_model_1/dense/MatMul;model_1/tf.nn.relu/Relu;model_1/dense/BiasAdd
[12/22/2021-22:09:43] [I] [TRT] [GpuLayer] shuffle_Relu__5:0 + (Unnamed Layer* 19) [Shuffle]
[12/22/2021-22:09:43] [I] [TRT] [GpuLayer] (Unnamed Layer* 24) [Shuffle] + shuffle_model_1/batch_normalization/batchnorm/mul_1;model_1/batch_normalization/batchnorm/add_1;model_1/dense_1/MatMul;model_1/tf.nn.relu_1/Relu;model_1/dense_1/BiasAdd1
[12/22/2021-22:09:43] [I] [TRT] [GpuLayer] shuffle_Relu__8:0 + (Unnamed Layer* 32) [Shuffle]
[12/22/2021-22:09:43] [I] [TRT] [GpuLayer] (Unnamed Layer* 37) [Shuffle] + shuffle_model_1/batch_normalization_1/batchnorm/mul_1;model_1/batch_normalization_1/batchnorm/add_1;model_1/dense_2/MatMul;model_1/tf.nn.relu_2/Relu;model_1/dense_2/BiasAdd1
[12/22/2021-22:09:43] [I] [TRT] [GpuLayer] shuffle_Relu__11:0 + (Unnamed Layer* 45) [Shuffle]
[12/22/2021-22:09:43] [I] [TRT] [GpuLayer] (Unnamed Layer* 50) [Shuffle]
[12/22/2021-22:09:44] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +222, GPU +221, now: CPU 605, GPU 18051 (MiB)
[12/22/2021-22:09:45] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +307, GPU +307, now: CPU 912, GPU 18358 (MiB)
[12/22/2021-22:09:45] [W] [TRT] Detected invalid timing cache, setup a local cache instead
[12/22/2021-22:10:02] [W] [TRT] No implementation obeys reformatting-free rules, at least 2 reformatting nodes are needed, now picking the fastest path instead.
[12/22/2021-22:10:02] [I] [TRT] Detected 1 inputs and 1 output network tensors.
[12/22/2021-22:10:02] [I] [TRT] Total Host Persistent Memory: 5936
[12/22/2021-22:10:02] [I] [TRT] Total Device Persistent Memory: 0
[12/22/2021-22:10:02] [I] [TRT] Total Scratch Memory: 0
[12/22/2021-22:10:02] [I] [TRT] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 1 MiB, GPU 0 MiB
[12/22/2021-22:10:02] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 918, GPU 18371 (MiB)
[12/22/2021-22:10:02] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +0, GPU +0, now: CPU 918, GPU 18371 (MiB)
[12/22/2021-22:10:02] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 918, GPU 18370 (MiB)
[12/22/2021-22:10:02] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 917, GPU 18370 (MiB)
[12/22/2021-22:10:02] [I] [TRT] [MemUsageSnapshot] Builder end: CPU 917 MiB, GPU 18370 MiB
[12/22/2021-22:10:02] [I] [TRT] Loaded engine size: 1 MB
[12/22/2021-22:10:02] [I] [TRT] [MemUsageSnapshot] deserializeCudaEngine begin: CPU 917 MiB, GPU 18370 MiB
[12/22/2021-22:10:02] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 919, GPU 18370 (MiB)
[12/22/2021-22:10:02] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +0, GPU +0, now: CPU 919, GPU 18370 (MiB)
[12/22/2021-22:10:02] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 919, GPU 18370 (MiB)
[12/22/2021-22:10:02] [I] [TRT] [MemUsageSnapshot] deserializeCudaEngine end: CPU 919 MiB, GPU 18370 MiB
[12/22/2021-22:10:02] [I] Engine built in 20.3703 sec.
[12/22/2021-22:10:02] [I] [TRT] [MemUsageSnapshot] ExecutionContext creation begin: CPU 912 MiB, GPU 18370 MiB
[12/22/2021-22:10:02] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 912, GPU 18370 (MiB)
[12/22/2021-22:10:02] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +0, GPU +0, now: CPU 912, GPU 18370 (MiB)
[12/22/2021-22:10:02] [I] [TRT] [MemUsageSnapshot] ExecutionContext creation end: CPU 914 MiB, GPU 18403 MiB
[12/22/2021-22:10:02] [I] Created input binding for serving_default_input_1:0 with dimensions 1x1024
[12/22/2021-22:10:02] [I] Created output binding for StatefulPartitionedCall:0 with dimensions 1x32
[12/22/2021-22:10:02] [I] Starting inference
[12/22/2021-22:10:05] [I] Warmup completed 70 queries over 200 ms
[12/22/2021-22:10:05] [I] Timing trace has 1273 queries over 3.00589 s
[12/22/2021-22:10:05] [I]
[12/22/2021-22:10:05] [I] === Trace details ===
[12/22/2021-22:10:05] [I] Trace averages of 10 runs:
[12/22/2021-22:10:05] [I] Average on 10 runs - GPU latency: 2.27348 ms - Host latency: 2.28257 ms (end to end 2.29189 ms, enqueue 2.1757 ms)
[12/22/2021-22:10:05] [I] Average on 10 runs - GPU latency: 2.26571 ms - Host latency: 2.27496 ms (end to end 2.28433 ms, enqueue 2.17351 ms)
[12/22/2021-22:10:05] [I] Average on 10 runs - GPU latency: 2.26489 ms - Host latency: 2.27401 ms (end to end 2.28399 ms, enqueue 2.1328 ms)
[12/22/2021-22:10:05] [I] Average on 10 runs - GPU latency: 2.28605 ms - Host latency: 2.29526 ms (end to end 2.30714 ms, enqueue 2.19991 ms)
[12/22/2021-22:10:05] [I] Average on 10 runs - GPU latency: 2.31249 ms - Host latency: 2.32162 ms (end to end 2.33328 ms, enqueue 2.20256 ms)
[12/22/2021-22:10:05] [I] Average on 10 runs - GPU latency: 2.30769 ms - Host latency: 2.31691 ms (end to end 2.32847 ms, enqueue 2.16727 ms)
[12/22/2021-22:10:05] [I] Average on 10 runs - GPU latency: 2.31793 ms - Host latency: 2.32724 ms (end to end 2.33726 ms, enqueue 2.13403 ms)
[12/22/2021-22:10:05] [I] Average on 10 runs - GPU latency: 2.34088 ms - Host latency: 2.35027 ms (end to end 2.36576 ms, enqueue 2.17764 ms)
[12/22/2021-22:10:05] [I] Average on 10 runs - GPU latency: 2.36003 ms - Host latency: 2.36925 ms (end to end 2.38106 ms, enqueue 2.29192 ms)
[12/22/2021-22:10:05] [I] Average on 10 runs - GPU latency: 2.35675 ms - Host latency: 2.36602 ms (end to end 2.378 ms, enqueue 2.23816 ms)
[12/22/2021-22:10:05] [I] Average on 10 runs - GPU latency: 2.36338 ms - Host latency: 2.37245 ms (end to end 2.38325 ms, enqueue 2.25236 ms)
[12/22/2021-22:10:05] [I] Average on 10 runs - GPU latency: 2.31824 ms - Host latency: 2.32723 ms (end to end 2.33754 ms, enqueue 2.21618 ms)
[12/22/2021-22:10:05] [I] Average on 10 runs - GPU latency: 2.39688 ms - Host latency: 2.41913 ms (end to end 2.42961 ms, enqueue 2.44243 ms)
[12/22/2021-22:10:05] [I] Average on 10 runs - GPU latency: 2.32871 ms - Host latency: 2.33891 ms (end to end 2.34882 ms, enqueue 2.15481 ms)
[12/22/2021-22:10:05] [I] Average on 10 runs - GPU latency: 2.30082 ms - Host latency: 2.31 ms (end to end 2.32145 ms, enqueue 2.1931 ms)
[12/22/2021-22:10:05] [I] Average on 10 runs - GPU latency: 2.35027 ms - Host latency: 2.35955 ms (end to end 2.3707 ms, enqueue 2.16702 ms)
[12/22/2021-22:10:05] [I] Average on 10 runs - GPU latency: 2.38275 ms - Host latency: 2.39311 ms (end to end 2.40414 ms, enqueue 2.41931 ms)
[12/22/2021-22:10:05] [I] Average on 10 runs - GPU latency: 2.3368 ms - Host latency: 2.34605 ms (end to end 2.3588 ms, enqueue 2.10796 ms)
[12/22/2021-22:10:05] [I] Average on 10 runs - GPU latency: 2.34354 ms - Host latency: 2.35276 ms (end to end 2.36295 ms, enqueue 2.37892 ms)
[12/22/2021-22:10:05] [I] Average on 10 runs - GPU latency: 2.31734 ms - Host latency: 2.348 ms (end to end 2.36008 ms, enqueue 2.15452 ms)
[12/22/2021-22:10:05] [I] Average on 10 runs - GPU latency: 2.38828 ms - Host latency: 2.39753 ms (end to end 2.40912 ms, enqueue 2.34579 ms)
[12/22/2021-22:10:05] [I] Average on 10 runs - GPU latency: 2.32244 ms - Host latency: 2.33174 ms (end to end 2.34432 ms, enqueue 2.05479 ms)
[12/22/2021-22:10:05] [I] Average on 10 runs - GPU latency: 2.37075 ms - Host latency: 2.37989 ms (end to end 2.39123 ms, enqueue 2.33702 ms)
[12/22/2021-22:10:05] [I] Average on 10 runs - GPU latency: 2.37292 ms - Host latency: 2.38209 ms (end to end 2.39303 ms, enqueue 2.18181 ms)
[12/22/2021-22:10:05] [I] Average on 10 runs - GPU latency: 2.32653 ms - Host latency: 2.33561 ms (end to end 2.34722 ms, enqueue 2.31613 ms)
[12/22/2021-22:10:05] [I] Average on 10 runs - GPU latency: 2.35685 ms - Host latency: 2.36608 ms (end to end 2.37784 ms, enqueue 2.26835 ms)
[12/22/2021-22:10:05] [I] Average on 10 runs - GPU latency: 2.30207 ms - Host latency: 2.31113 ms (end to end 2.32198 ms, enqueue 2.13197 ms)
[12/22/2021-22:10:05] [I] Average on 10 runs - GPU latency: 2.31064 ms - Host latency: 2.31984 ms (end to end 2.33041 ms, enqueue 2.22122 ms)
[12/22/2021-22:10:05] [I] Average on 10 runs - GPU latency: 2.33369 ms - Host latency: 2.34301 ms (end to end 2.35592 ms, enqueue 2.1929 ms)
[12/22/2021-22:10:05] [I] Average on 10 runs - GPU latency: 2.33137 ms - Host latency: 2.34058 ms (end to end 2.35375 ms, enqueue 2.29441 ms)
[12/22/2021-22:10:05] [I] Average on 10 runs - GPU latency: 2.37321 ms - Host latency: 2.3823 ms (end to end 2.39378 ms, enqueue 2.17401 ms)
[12/22/2021-22:10:05] [I] Average on 10 runs - GPU latency: 2.2907 ms - Host latency: 2.29973 ms (end to end 2.3103 ms, enqueue 2.3211 ms)
[12/22/2021-22:10:05] [I] Average on 10 runs - GPU latency: 2.32325 ms - Host latency: 2.33265 ms (end to end 2.34507 ms, enqueue 2.22189 ms)
[12/22/2021-22:10:05] [I] Average on 10 runs - GPU latency: 2.24885 ms - Host latency: 2.25814 ms (end to end 2.27086 ms, enqueue 2.15769 ms)
[12/22/2021-22:10:05] [I] Average on 10 runs - GPU latency: 2.4538 ms - Host latency: 2.4915 ms (end to end 2.50706 ms, enqueue 2.40946 ms)
[12/22/2021-22:10:05] [I] Average on 10 runs - GPU latency: 2.31415 ms - Host latency: 2.32332 ms (end to end 2.334 ms, enqueue 2.05829 ms)
[12/22/2021-22:10:05] [I] Average on 10 runs - GPU latency: 2.40486 ms - Host latency: 2.41399 ms (end to end 2.42638 ms, enqueue 2.47235 ms)
[12/22/2021-22:10:05] [I] Average on 10 runs - GPU latency: 2.33209 ms - Host latency: 2.34597 ms (end to end 2.35818 ms, enqueue 2.12206 ms)
[12/22/2021-22:10:05] [I] Average on 10 runs - GPU latency: 2.34137 ms - Host latency: 2.35059 ms (end to end 2.36686 ms, enqueue 2.16597 ms)
[12/22/2021-22:10:05] [I] Average on 10 runs - GPU latency: 2.30842 ms - Host latency: 2.3184 ms (end to end 2.33058 ms, enqueue 2.31343 ms)
[12/22/2021-22:10:05] [I] Average on 10 runs - GPU latency: 2.33767 ms - Host latency: 2.3473 ms (end to end 2.35673 ms, enqueue 2.25785 ms)
[12/22/2021-22:10:05] [I] Average on 10 runs - GPU latency: 2.31724 ms - Host latency: 2.32646 ms (end to end 2.33721 ms, enqueue 2.20883 ms)
[12/22/2021-22:10:05] [I] Average on 10 runs - GPU latency: 2.36208 ms - Host latency: 2.3712 ms (end to end 2.38467 ms, enqueue 2.10894 ms)
[12/22/2021-22:10:05] [I] Average on 10 runs - GPU latency: 2.34012 ms - Host latency: 2.34934 ms (end to end 2.36072 ms, enqueue 2.25509 ms)
[12/22/2021-22:10:05] [I] Average on 10 runs - GPU latency: 2.35192 ms - Host latency: 2.36111 ms (end to end 2.37212 ms, enqueue 2.27799 ms)
[12/22/2021-22:10:05] [I] Average on 10 runs - GPU latency: 2.3073 ms - Host latency: 2.31655 ms (end to end 2.32825 ms, enqueue 2.25317 ms)
[12/22/2021-22:10:05] [I] Average on 10 runs - GPU latency: 2.29608 ms - Host latency: 2.3054 ms (end to end 2.31742 ms, enqueue 2.18336 ms)
[12/22/2021-22:10:05] [I] Average on 10 runs - GPU latency: 2.38372 ms - Host latency: 2.39287 ms (end to end 2.40497 ms, enqueue 2.15863 ms)
[12/22/2021-22:10:05] [I] Average on 10 runs - GPU latency: 2.3423 ms - Host latency: 2.35139 ms (end to end 2.36399 ms, enqueue 2.34648 ms)
[12/22/2021-22:10:05] [I] Average on 10 runs - GPU latency: 2.30349 ms - Host latency: 2.31271 ms (end to end 2.3222 ms, enqueue 2.28307 ms)
[12/22/2021-22:10:05] [I] Average on 10 runs - GPU latency: 2.31006 ms - Host latency: 2.31931 ms (end to end 2.33163 ms, enqueue 2.18951 ms)
[12/22/2021-22:10:05] [I] Average on 10 runs - GPU latency: 2.3251 ms - Host latency: 2.33414 ms (end to end 2.34727 ms, enqueue 2.18853 ms)
[12/22/2021-22:10:05] [I] Average on 10 runs - GPU latency: 2.32848 ms - Host latency: 2.33788 ms (end to end 2.34805 ms, enqueue 2.12893 ms)
[12/22/2021-22:10:05] [I] Average on 10 runs - GPU latency: 2.33677 ms - Host latency: 2.34586 ms (end to end 2.35789 ms, enqueue 2.21041 ms)
[12/22/2021-22:10:05] [I] Average on 10 runs - GPU latency: 2.3723 ms - Host latency: 2.38143 ms (end to end 2.3942 ms, enqueue 2.25264 ms)
[12/22/2021-22:10:05] [I] Average on 10 runs - GPU latency: 2.34486 ms - Host latency: 2.35403 ms (end to end 2.36505 ms, enqueue 2.26473 ms)
[12/22/2021-22:10:05] [I] Average on 10 runs - GPU latency: 2.32848 ms - Host latency: 2.33762 ms (end to end 2.34904 ms, enqueue 2.27322 ms)
[12/22/2021-22:10:05] [I] Average on 10 runs - GPU latency: 2.33718 ms - Host latency: 2.34646 ms (end to end 2.35769 ms, enqueue 2.36376 ms)
[12/22/2021-22:10:05] [I] Average on 10 runs - GPU latency: 2.33306 ms - Host latency: 2.3688 ms (end to end 2.38444 ms, enqueue 2.12144 ms)
[12/22/2021-22:10:05] [I] Average on 10 runs - GPU latency: 2.35535 ms - Host latency: 2.36472 ms (end to end 2.37747 ms, enqueue 2.21721 ms)
[12/22/2021-22:10:05] [I] Average on 10 runs - GPU latency: 2.34539 ms - Host latency: 2.35458 ms (end to end 2.36595 ms, enqueue 2.24717 ms)
[12/22/2021-22:10:05] [I] Average on 10 runs - GPU latency: 2.32881 ms - Host latency: 2.33807 ms (end to end 2.35076 ms, enqueue 2.17253 ms)
[12/22/2021-22:10:05] [I] Average on 10 runs - GPU latency: 2.32838 ms - Host latency: 2.33741 ms (end to end 2.34893 ms, enqueue 2.35585 ms)
[12/22/2021-22:10:05] [I] Average on 10 runs - GPU latency: 2.35153 ms - Host latency: 2.36075 ms (end to end 2.37303 ms, enqueue 2.10829 ms)
[12/22/2021-22:10:05] [I] Average on 10 runs - GPU latency: 2.3429 ms - Host latency: 2.35201 ms (end to end 2.365 ms, enqueue 2.30242 ms)
[12/22/2021-22:10:05] [I] Average on 10 runs - GPU latency: 2.34271 ms - Host latency: 2.35189 ms (end to end 2.36594 ms, enqueue 2.20608 ms)
[12/22/2021-22:10:05] [I] Average on 10 runs - GPU latency: 2.37054 ms - Host latency: 2.37948 ms (end to end 2.38887 ms, enqueue 2.2334 ms)
[12/22/2021-22:10:05] [I] Average on 10 runs - GPU latency: 2.38756 ms - Host latency: 2.39648 ms (end to end 2.40627 ms, enqueue 2.28197 ms)
[12/22/2021-22:10:05] [I] Average on 10 runs - GPU latency: 2.3634 ms - Host latency: 2.37246 ms (end to end 2.38303 ms, enqueue 2.27189 ms)
[12/22/2021-22:10:05] [I] Average on 10 runs - GPU latency: 2.2917 ms - Host latency: 2.30084 ms (end to end 2.31158 ms, enqueue 2.3127 ms)
[12/22/2021-22:10:05] [I] Average on 10 runs - GPU latency: 2.31693 ms - Host latency: 2.32615 ms (end to end 2.33973 ms, enqueue 2.27629 ms)
[12/22/2021-22:10:05] [I] Average on 10 runs - GPU latency: 2.33231 ms - Host latency: 2.34229 ms (end to end 2.35544 ms, enqueue 2.05042 ms)
[12/22/2021-22:10:05] [I] Average on 10 runs - GPU latency: 2.35739 ms - Host latency: 2.36669 ms (end to end 2.38138 ms, enqueue 2.24302 ms)
[12/22/2021-22:10:05] [I] Average on 10 runs - GPU latency: 2.30945 ms - Host latency: 2.31858 ms (end to end 2.33003 ms, enqueue 2.26772 ms)
[12/22/2021-22:10:05] [I] Average on 10 runs - GPU latency: 2.35151 ms - Host latency: 2.36064 ms (end to end 2.37092 ms, enqueue 2.24977 ms)
[12/22/2021-22:10:05] [I] Average on 10 runs - GPU latency: 2.39144 ms - Host latency: 2.4005 ms (end to end 2.41108 ms, enqueue 2.19995 ms)
[12/22/2021-22:10:05] [I] Average on 10 runs - GPU latency: 2.33127 ms - Host latency: 2.34037 ms (end to end 2.35197 ms, enqueue 2.38295 ms)
[12/22/2021-22:10:05] [I] Average on 10 runs - GPU latency: 2.34987 ms - Host latency: 2.35908 ms (end to end 2.37126 ms, enqueue 2.19183 ms)
[12/22/2021-22:10:05] [I] Average on 10 runs - GPU latency: 2.3472 ms - Host latency: 2.35641 ms (end to end 2.37003 ms, enqueue 2.14624 ms)
[12/22/2021-22:10:05] [I] Average on 10 runs - GPU latency: 2.28455 ms - Host latency: 2.2939 ms (end to end 2.30666 ms, enqueue 2.22813 ms)
[12/22/2021-22:10:05] [I] Average on 10 runs - GPU latency: 2.30774 ms - Host latency: 2.31689 ms (end to end 2.32925 ms, enqueue 2.21499 ms)
[12/22/2021-22:10:05] [I] Average on 10 runs - GPU latency: 2.36025 ms - Host latency: 2.36938 ms (end to end 2.3801 ms, enqueue 2.23103 ms)
[12/22/2021-22:10:05] [I] Average on 10 runs - GPU latency: 2.33491 ms - Host latency: 2.34407 ms (end to end 2.35593 ms, enqueue 2.21255 ms)
[12/22/2021-22:10:05] [I] Average on 10 runs - GPU latency: 2.37795 ms - Host latency: 2.38706 ms (end to end 2.39646 ms, enqueue 2.46699 ms)
[12/22/2021-22:10:05] [I] Average on 10 runs - GPU latency: 2.36316 ms - Host latency: 2.37231 ms (end to end 2.38142 ms, enqueue 2.08936 ms)
[12/22/2021-22:10:05] [I] Average on 10 runs - GPU latency: 2.38875 ms - Host latency: 2.39792 ms (end to end 2.40977 ms, enqueue 2.25276 ms)
[12/22/2021-22:10:05] [I] Average on 10 runs - GPU latency: 2.31003 ms - Host latency: 2.31926 ms (end to end 2.33042 ms, enqueue 2.23843 ms)
[12/22/2021-22:10:05] [I] Average on 10 runs - GPU latency: 2.345 ms - Host latency: 2.35408 ms (end to end 2.36738 ms, enqueue 2.26812 ms)
[12/22/2021-22:10:05] [I] Average on 10 runs - GPU latency: 2.35054 ms - Host latency: 2.35969 ms (end to end 2.37214 ms, enqueue 2.21011 ms)
[12/22/2021-22:10:05] [I] Average on 10 runs - GPU latency: 2.37061 ms - Host latency: 2.37966 ms (end to end 2.39163 ms, enqueue 2.27795 ms)
[12/22/2021-22:10:05] [I] Average on 10 runs - GPU latency: 2.30439 ms - Host latency: 2.31357 ms (end to end 2.32639 ms, enqueue 2.39697 ms)
[12/22/2021-22:10:05] [I] Average on 10 runs - GPU latency: 2.48245 ms - Host latency: 2.50679 ms (end to end 2.52119 ms, enqueue 2.28511 ms)
[12/22/2021-22:10:05] [I] Average on 10 runs - GPU latency: 2.29165 ms - Host latency: 2.3009 ms (end to end 2.3126 ms, enqueue 2.1375 ms)
[12/22/2021-22:10:05] [I] Average on 10 runs - GPU latency: 2.28682 ms - Host latency: 2.2989 ms (end to end 2.30923 ms, enqueue 2.33728 ms)
[12/22/2021-22:10:05] [I] Average on 10 runs - GPU latency: 2.32219 ms - Host latency: 2.37141 ms (end to end 2.38193 ms, enqueue 2.13503 ms)
[12/22/2021-22:10:05] [I] Average on 10 runs - GPU latency: 2.31218 ms - Host latency: 2.33203 ms (end to end 2.34836 ms, enqueue 2.2697 ms)
[12/22/2021-22:10:05] [I] Average on 10 runs - GPU latency: 2.33315 ms - Host latency: 2.34236 ms (end to end 2.35369 ms, enqueue 2.15786 ms)
[12/22/2021-22:10:05] [I] Average on 10 runs - GPU latency: 2.32822 ms - Host latency: 2.3374 ms (end to end 2.34919 ms, enqueue 2.14966 ms)
[12/22/2021-22:10:05] [I] Average on 10 runs - GPU latency: 2.34231 ms - Host latency: 2.35134 ms (end to end 2.36245 ms, enqueue 2.36167 ms)
[12/22/2021-22:10:05] [I] Average on 10 runs - GPU latency: 2.3446 ms - Host latency: 2.35393 ms (end to end 2.36838 ms, enqueue 2.1283 ms)
[12/22/2021-22:10:05] [I] Average on 10 runs - GPU latency: 2.35466 ms - Host latency: 2.36382 ms (end to end 2.37588 ms, enqueue 2.2125 ms)
[12/22/2021-22:10:05] [I] Average on 10 runs - GPU latency: 2.35486 ms - Host latency: 2.36387 ms (end to end 2.37405 ms, enqueue 2.41348 ms)
[12/22/2021-22:10:05] [I] Average on 10 runs - GPU latency: 2.45825 ms - Host latency: 2.49775 ms (end to end 2.51001 ms, enqueue 2.34666 ms)
[12/22/2021-22:10:05] [I] Average on 10 runs - GPU latency: 2.37017 ms - Host latency: 2.37947 ms (end to end 2.39299 ms, enqueue 2.31099 ms)
[12/22/2021-22:10:05] [I] Average on 10 runs - GPU latency: 2.3563 ms - Host latency: 2.3655 ms (end to end 2.37668 ms, enqueue 2.054 ms)
[12/22/2021-22:10:05] [I] Average on 10 runs - GPU latency: 2.29341 ms - Host latency: 2.30242 ms (end to end 2.31272 ms, enqueue 2.28892 ms)
[12/22/2021-22:10:05] [I] Average on 10 runs - GPU latency: 2.37485 ms - Host latency: 2.38398 ms (end to end 2.39727 ms, enqueue 2.18545 ms)
[12/22/2021-22:10:05] [I] Average on 10 runs - GPU latency: 2.3373 ms - Host latency: 2.34646 ms (end to end 2.35798 ms, enqueue 2.28132 ms)
[12/22/2021-22:10:05] [I] Average on 10 runs - GPU latency: 2.34519 ms - Host latency: 2.3543 ms (end to end 2.36338 ms, enqueue 2.15588 ms)
[12/22/2021-22:10:05] [I] Average on 10 runs - GPU latency: 2.31042 ms - Host latency: 2.33672 ms (end to end 2.34783 ms, enqueue 2.34993 ms)
[12/22/2021-22:10:05] [I] Average on 10 runs - GPU latency: 2.32483 ms - Host latency: 2.33396 ms (end to end 2.34661 ms, enqueue 2.24468 ms)
[12/22/2021-22:10:05] [I] Average on 10 runs - GPU latency: 2.31289 ms - Host latency: 2.32197 ms (end to end 2.33301 ms, enqueue 2.04485 ms)
[12/22/2021-22:10:05] [I] Average on 10 runs - GPU latency: 2.27849 ms - Host latency: 2.28774 ms (end to end 2.299 ms, enqueue 2.25837 ms)
[12/22/2021-22:10:05] [I] Average on 10 runs - GPU latency: 2.31997 ms - Host latency: 2.32917 ms (end to end 2.3446 ms, enqueue 2.24377 ms)
[12/22/2021-22:10:05] [I] Average on 10 runs - GPU latency: 2.3406 ms - Host latency: 2.34971 ms (end to end 2.36245 ms, enqueue 2.11245 ms)
[12/22/2021-22:10:05] [I] Average on 10 runs - GPU latency: 2.34558 ms - Host latency: 2.35476 ms (end to end 2.36916 ms, enqueue 2.18943 ms)
[12/22/2021-22:10:05] [I] Average on 10 runs - GPU latency: 2.32749 ms - Host latency: 2.33682 ms (end to end 2.34922 ms, enqueue 2.24592 ms)
[12/22/2021-22:10:05] [I] Average on 10 runs - GPU latency: 2.36995 ms - Host latency: 2.37908 ms (end to end 2.38982 ms, enqueue 2.24321 ms)
[12/22/2021-22:10:05] [I] Average on 10 runs - GPU latency: 2.3613 ms - Host latency: 2.37041 ms (end to end 2.38015 ms, enqueue 2.41277 ms)
[12/22/2021-22:10:05] [I] Average on 10 runs - GPU latency: 2.31982 ms - Host latency: 2.32898 ms (end to end 2.34031 ms, enqueue 2.18667 ms)
[12/22/2021-22:10:05] [I] Average on 10 runs - GPU latency: 2.31936 ms - Host latency: 2.32854 ms (end to end 2.34097 ms, enqueue 2.11768 ms)
[12/22/2021-22:10:05] [I] Average on 10 runs - GPU latency: 2.352 ms - Host latency: 2.36118 ms (end to end 2.37268 ms, enqueue 2.29321 ms)
[12/22/2021-22:10:05] [I] Average on 10 runs - GPU latency: 2.35925 ms - Host latency: 2.36834 ms (end to end 2.38032 ms, enqueue 2.2137 ms)
[12/22/2021-22:10:05] [I] Average on 10 runs - GPU latency: 2.34602 ms - Host latency: 2.35522 ms (end to end 2.36707 ms, enqueue 2.22451 ms)
[12/22/2021-22:10:05] [I] Average on 10 runs - GPU latency: 2.31621 ms - Host latency: 2.32534 ms (end to end 2.33892 ms, enqueue 2.33877 ms)
[12/22/2021-22:10:05] [I] Average on 10 runs - GPU latency: 2.36208 ms - Host latency: 2.37134 ms (end to end 2.38416 ms, enqueue 2.20347 ms)
[12/22/2021-22:10:05] [I] Average on 10 runs - GPU latency: 2.36577 ms - Host latency: 2.37495 ms (end to end 2.38621 ms, enqueue 2.24563 ms)
[12/22/2021-22:10:05] [I]
[12/22/2021-22:10:05] [I] === Performance summary ===
[12/22/2021-22:10:05] [I] Throughput: 423.502 qps
[12/22/2021-22:10:05] [I] Latency: min = 2.14111 ms, max = 3.47217 ms, mean = 2.34943 ms, median = 2.34692 ms, percentile(99%) = 2.67383 ms
[12/22/2021-22:10:05] [I] End-to-End Host Latency: min = 2.14868 ms, max = 3.48657 ms, mean = 2.36126 ms, median = 2.35895 ms, percentile(99%) = 2.68726 ms
[12/22/2021-22:10:05] [I] Enqueue Time: min = 1.73706 ms, max = 3.93011 ms, mean = 2.23319 ms, median = 2.16205 ms, percentile(99%) = 3.42651 ms
[12/22/2021-22:10:05] [I] H2D Latency: min = 0.00610352 ms, max = 0.189331 ms, mean = 0.00797552 ms, median = 0.00634766 ms, percentile(99%) = 0.0721436 ms
[12/22/2021-22:10:05] [I] GPU Compute Time: min = 2.13208 ms, max = 3.3606 ms, mean = 2.33857 ms, median = 2.33679 ms, percentile(99%) = 2.61816 ms
[12/22/2021-22:10:05] [I] D2H Latency: min = 0.00244141 ms, max = 0.0639038 ms, mean = 0.00288704 ms, median = 0.00271606 ms, percentile(99%) = 0.003479 ms
[12/22/2021-22:10:05] [I] Total Host Walltime: 3.00589 s
[12/22/2021-22:10:05] [I] Total GPU Compute Time: 2.977 s
[12/22/2021-22:10:05] [W] * Throughput may be bound by Enqueue Time rather than GPU Compute and the GPU may be under-utilized.
[12/22/2021-22:10:05] [W]   If not already in use, --useCudaGraph (utilize CUDA graphs where possible) may increase the throughput.
[12/22/2021-22:10:05] [I] Explanations of the performance metrics are printed in the verbose logs.
[12/22/2021-22:10:05] [I]
&&&& PASSED TensorRT.trtexec [TensorRT v8001] # /usr/src/tensorrt/bin/trtexec --onnx=reid-model.onnx --useDLACore=0 --allowGPUFallback
[12/22/2021-22:10:06] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 914, GPU 18373 (MiB)

Thanks.

Hi,

in original model there is no shuffle layer, they were inserted by TRT itself:

[12/22/2021-18:42:36] [W] [TRT] Input tensor has less than 4 dimensions for Relu__5. At least one shuffle layer will be inserted which cannot run on DLA.

And the error looks weird, how it can expect 4 dimensions, if network is 1D + batch ? The basic model is very simple:

def dense(units: int, x):
    x = Dense(units)(x)
    x = tf.keras.activations.relu(x)
    return BatchNormalization()(x)

def create_reid_model():
    x0 = Input(1024)

    x1 = dense(512, x0)
    x2 = dense(256, Dropout(0.50)(x1))
    x3 = dense(128, Dropout(0.50)(x2))

    xf = Dense(32, name="sub")(Dropout(0.50)(x3))

    return Model(inputs=x0, outputs=xf)

Moreover, simplification even further is not solving problem. On TRT 7 it’s working.

Hi,

Please note that DLA is a hardware-based inference engine.
It was designed based on CNN so it might be as flexible as GPU.

We will check this with our internal team and share more information with you later.
Thanks.

Any updates ? We are trying to squeeze all computational capacities of Xavier, and not working DLA is quite important.

Hi,

Thanks for your patience.

Sorry that our team is still checking on this.
We will update here once we got a response.

Thanks.

Hi,

Do you have the TensorRT v7.1 output to share with us?

When testing the model with v7.1, we found it is actually running on GPU although no allowGPUFallback is specified.

Log of /usr/src/tensorrt/bin/trtexec --onnx=/home/nvidia/reid-model.onnx --useDLACore=0 --explicitBatch --verbose 
Wed Dec 29 00:28:59 2021

&&&& RUNNING TensorRT.trtexec # /usr/src/tensorrt/bin/trtexec --onnx=/home/nvidia/reid-model.onnx --useDLACore=0 --explicitBatch --verbose
[12/29/2021-00:28:59] [I] === Model Options ===
[12/29/2021-00:28:59] [I] Format: ONNX
[12/29/2021-00:28:59] [I] Model: /home/nvidia/reid-model.onnx
[12/29/2021-00:28:59] [I] Output:
...
[12/29/2021-00:28:59] [I] === System Options ===
[12/29/2021-00:28:59] [I] Device: 0
[12/29/2021-00:28:59] [I] DLACore: 0
[12/29/2021-00:28:59] [I] Plugins:
...
[12/29/2021-00:29:00] [V] [TRT] Graph construction and optimization completed in 0.00583441 seconds.
[12/29/2021-00:29:00] [I] [TRT] 
[12/29/2021-00:29:00] [I] [TRT] --------------- Layers running on DLA: 
[12/29/2021-00:29:00] [I] [TRT] 
[12/29/2021-00:29:00] [I] [TRT] --------------- Layers running on GPU: 
[12/29/2021-00:29:00] [I] [TRT] (Unnamed Layer* 6) [Shuffle], model_1/dense/MatMul;model_1/tf.nn.relu/Relu;model_1/dense/BiasAdd + Relu__5, squeeze_after_Relu__5 + (Unnamed Layer* 19) [Shuffle], model_1/batch_normalization/batchnorm/mul_1;model_1/batch_normalization/batchnorm/add_1;model_1/dense_1/MatMul;model_1/tf.nn.relu_1/Relu;model_1/dense_1/BiasAdd1 + Relu__8, squeeze_after_Relu__8 + (Unnamed Layer* 32) [Shuffle], model_1/batch_normalization_1/batchnorm/mul_1;model_1/batch_normalization_1/batchnorm/add_1;model_1/dense_2/MatMul;model_1/tf.nn.relu_2/Relu;model_1/dense_2/BiasAdd1 + Relu__11, squeeze_after_Relu__11 + (Unnamed Layer* 45) [Shuffle], StatefulPartitionedCall:0, (Unnamed Layer* 50) [Shuffle], 
[12/29/2021-00:29:02] [V] [TRT] Constructing optimization profile number 0 [1/1].
...

So this is not a regression issue.
This model will need GPU support to inference on Jetson.

You will get better support for this on DLA in the future release.
But currently, please enable the allowGPUFallback configure to deploy the model.

Thanks.

The output is the same as your, so the problem in unsupported layer - didn’t saw that its executed on GPU entirely.

Thanks for help, however i’m really surprised that so trivial (sequential, dense layers only, relu activation and batch norm) model is failing on DLA.

Hi,

Thanks for confirming this.

DLA is originally designed for CNN.
That’s why it expects the input to be a 4-dimension format buffer.

Thanks.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.