Tensorrt fails shapeMachine.cpp

arkos · February 15, 2024, 7:34pm

Description

A clear and concise description of the bug or issue.

Environment

TensorRT Version:
GPU Type - Drive Orin AGX:
Nvidia Driver Version:
CUDA Version:
CUDNN Version:
Operating System + Version:
Python Version (if applicable):
TensorFlow Version (if applicable):
PyTorch Version (if applicable):
Baremetal or Container (if container which image + tag):

Relevant Files

I have generated the onnx using aten_fallback, opset 16. I see this error, what could be the issue.

[02/15/2024-19:28:55] [E] Error[7]: [shapeMachine.cpp::executeContinuation::864] Error Code 7: Internal Error (IShuffleLayer /Reshape_12: reshaping failed for tensor: /GatherND_3_output_0 reshape would change volume 1516 to 24256 Instruction: RESHAPE{379 4} {379 64}.)

Steps To Reproduce

Please include:

Exact steps/commands to build your repro
Exact steps/commands to run your repro
Full traceback of errors encountered

AakankshaS · February 16, 2024, 6:15am

Hi @arkos ,
Can you please help us with the onnx model and detailed logs?

Thanks

arkos · February 16, 2024, 7:02am

ONNX 1 - Original onnx.
ONNX 2 - Repaired onnx using the code

def onnx_allowzero(modelf="model_transposed_constant_folding_bsz1_a0_b0.onnx"):
    graph = gs.import_onnx(onnx.load(modelf))

    for node in graph.nodes:
        if node.op == "Reshape":
            node.attrs["allowzero"] = 1

    onnx.save(gs.export_onnx(graph), "repaired.onnx")

error log of tensorrt

nvidia@tegra-ubuntu:/usr/src/tensorrt/bin$ ./trtexec --onnx=/home/nvidia/models/repaired.onnx --saveEngine=model_transposed_constant_folding_bsz1_a0_b0_repaired.trt
&&&& RUNNING TensorRT.trtexec [TensorRT v8611] # ./trtexec --onnx=/home/nvidia/models/repaired.onnx --saveEngine=model_transposed_constant_folding_bsz1_a0_b0_repaired.trt
[02/16/2024-06:53:42] [I] === Model Options ===
[02/16/2024-06:53:42] [I] Format: ONNX
[02/16/2024-06:53:42] [I] Model: /home/nvidia/models/repaired.onnx
[02/16/2024-06:53:42] [I] Output:
[02/16/2024-06:53:42] [I] === Build Options ===
[02/16/2024-06:53:42] [I] Max batch: explicit batch
[02/16/2024-06:53:42] [I] Memory Pools: workspace: default, dlaSRAM: default, dlaLocalDRAM: default, dlaGlobalDRAM: default
[02/16/2024-06:53:42] [I] minTiming: 1
[02/16/2024-06:53:42] [I] avgTiming: 8
[02/16/2024-06:53:42] [I] Precision: FP32
[02/16/2024-06:53:42] [I] LayerPrecisions: 
[02/16/2024-06:53:42] [I] Layer Device Types: 
[02/16/2024-06:53:42] [I] Calibration: 
[02/16/2024-06:53:42] [I] Refit: Disabled
[02/16/2024-06:53:42] [I] Version Compatible: Disabled
[02/16/2024-06:53:42] [I] TensorRT runtime: full
[02/16/2024-06:53:42] [I] Lean DLL Path: 
[02/16/2024-06:53:42] [I] Tempfile Controls: { in_memory: allow, temporary: allow }
[02/16/2024-06:53:42] [I] Exclude Lean Runtime: Disabled
[02/16/2024-06:53:42] [I] Sparsity: Disabled
[02/16/2024-06:53:42] [I] Safe mode: Disabled
[02/16/2024-06:53:42] [I] Build DLA standalone loadable: Disabled
[02/16/2024-06:53:42] [I] Allow GPU fallback for DLA: Disabled
[02/16/2024-06:53:42] [I] DirectIO mode: Disabled
[02/16/2024-06:53:42] [I] Restricted mode: Disabled
[02/16/2024-06:53:42] [I] Skip inference: Disabled
[02/16/2024-06:53:42] [I] Save engine: model_transposed_constant_folding_bsz1_a0_b0_repaired.trt
[02/16/2024-06:53:42] [I] Load engine: 
[02/16/2024-06:53:42] [I] Profiling verbosity: 0
[02/16/2024-06:53:42] [I] Tactic sources: Using default tactic sources
[02/16/2024-06:53:42] [I] timingCacheMode: local
[02/16/2024-06:53:42] [I] timingCacheFile: 
[02/16/2024-06:53:42] [I] Heuristic: Disabled
[02/16/2024-06:53:42] [I] Preview Features: Use default preview flags.
[02/16/2024-06:53:42] [I] MaxAuxStreams: -1
[02/16/2024-06:53:42] [I] BuilderOptimizationLevel: -1
[02/16/2024-06:53:42] [I] Calibration Profile Index: 0
[02/16/2024-06:53:42] [I] Input(s)s format: fp32:CHW
[02/16/2024-06:53:42] [I] Output(s)s format: fp32:CHW
[02/16/2024-06:53:42] [I] Input build shapes: model
[02/16/2024-06:53:42] [I] Input calibration shapes: model
[02/16/2024-06:53:42] [I] === System Options ===
[02/16/2024-06:53:42] [I] Device: 0
[02/16/2024-06:53:42] [I] DLACore: 
[02/16/2024-06:53:42] [I] Plugins:
[02/16/2024-06:53:42] [I] setPluginsToSerialize:
[02/16/2024-06:53:42] [I] dynamicPlugins:
[02/16/2024-06:53:42] [I] ignoreParsedPluginLibs: 0
[02/16/2024-06:53:42] [I] 
[02/16/2024-06:53:42] [I] === Inference Options ===
[02/16/2024-06:53:42] [I] Batch: Explicit
[02/16/2024-06:53:42] [I] Input inference shapes: model
[02/16/2024-06:53:42] [I] Iterations: 10
[02/16/2024-06:53:42] [I] Duration: 3s (+ 200ms warm up)
[02/16/2024-06:53:42] [I] Sleep time: 0ms
[02/16/2024-06:53:42] [I] Idle time: 0ms
[02/16/2024-06:53:42] [I] Inference Streams: 1
[02/16/2024-06:53:42] [I] ExposeDMA: Disabled
[02/16/2024-06:53:42] [I] Data transfers: Enabled
[02/16/2024-06:53:42] [I] Spin-wait: Disabled
[02/16/2024-06:53:42] [I] Multithreading: Disabled
[02/16/2024-06:53:42] [I] CUDA Graph: Disabled
[02/16/2024-06:53:42] [I] Separate profiling: Disabled
[02/16/2024-06:53:42] [I] Time Deserialize: Disabled
[02/16/2024-06:53:42] [I] Time Refit: Disabled
[02/16/2024-06:53:42] [I] NVTX verbosity: 0
[02/16/2024-06:53:42] [I] Persistent Cache Ratio: 0
[02/16/2024-06:53:42] [I] Optimization Profile Index: 0
[02/16/2024-06:53:42] [I] Inputs:
[02/16/2024-06:53:42] [I] === Reporting Options ===
[02/16/2024-06:53:42] [I] Verbose: Disabled
[02/16/2024-06:53:42] [I] Averages: 10 inferences
[02/16/2024-06:53:42] [I] Percentiles: 90,95,99
[02/16/2024-06:53:42] [I] Dump refittable layers:Disabled
[02/16/2024-06:53:42] [I] Dump output: Disabled
[02/16/2024-06:53:42] [I] Profile: Disabled
[02/16/2024-06:53:42] [I] Export timing to JSON file: 
[02/16/2024-06:53:42] [I] Export output to JSON file: 
[02/16/2024-06:53:42] [I] Export profile to JSON file: 
[02/16/2024-06:53:42] [I] 
[02/16/2024-06:53:42] [I] === Device Information ===
[02/16/2024-06:53:42] [I] Selected Device: Orin
[02/16/2024-06:53:42] [I] Compute Capability: 8.7
[02/16/2024-06:53:42] [I] SMs: 16
[02/16/2024-06:53:42] [I] Device Global Memory: 28902 MiB
[02/16/2024-06:53:42] [I] Shared Memory per SM: 164 KiB
[02/16/2024-06:53:42] [I] Memory Bus Width: 128 bits (ECC disabled)
[02/16/2024-06:53:42] [I] Application Compute Clock Rate: 1.275 GHz
[02/16/2024-06:53:42] [I] Application Memory Clock Rate: 1.275 GHz
[02/16/2024-06:53:42] [I] 
[02/16/2024-06:53:42] [I] Note: The application clock rates do not reflect the actual clock rates that the GPU is currently running at.
[02/16/2024-06:53:42] [I] 
[02/16/2024-06:53:42] [I] TensorRT version: 8.6.11
[02/16/2024-06:53:42] [I] Loading standard plugins
[02/16/2024-06:53:43] [I] [TRT] [MemUsageChange] Init CUDA: CPU +443, GPU +0, now: CPU 463, GPU 9089 (MiB)
[02/16/2024-06:53:45] [I] [TRT] [MemUsageChange] Init builder kernel library: CPU +834, GPU +991, now: CPU 1329, GPU 10106 (MiB)
[02/16/2024-06:53:45] [I] Start parsing network model.
[02/16/2024-06:53:45] [I] [TRT] ----------------------------------------------------------------
[02/16/2024-06:53:45] [I] [TRT] Input filename:   /home/nvidia/models/repaired.onnx
[02/16/2024-06:53:45] [I] [TRT] ONNX IR version:  0.0.9
[02/16/2024-06:53:45] [I] [TRT] Opset version:    16
[02/16/2024-06:53:45] [I] [TRT] Producer name:    pytorch
[02/16/2024-06:53:45] [I] [TRT] Producer version: 2.1.1
[02/16/2024-06:53:45] [I] [TRT] Domain:           
[02/16/2024-06:53:45] [I] [TRT] Model version:    0
[02/16/2024-06:53:45] [I] [TRT] Doc string:       
[02/16/2024-06:53:45] [I] [TRT] ----------------------------------------------------------------
[02/16/2024-06:53:45] [W] [TRT] onnx2trt_utils.cpp:372: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
[02/16/2024-06:53:45] [W] [TRT] onnx2trt_utils.cpp:400: One or more weights outside the range of INT32 was clamped
[02/16/2024-06:53:45] [W] [TRT] builtin_op_importers.cpp:4875: TensorRT is using FLOAT32 precision to run an INT32 TopK. Rounding errors may occur for large integer values
[02/16/2024-06:53:45] [I] Finished parsing network model. Parse time: 0.145678
[02/16/2024-06:53:45] [I] [TRT] Graph optimization time: 0.122808 seconds.
[02/16/2024-06:53:45] [I] [TRT] Local timing cache in use. Profiling results in this builder pass will not be stored.
[02/16/2024-06:57:11] [I] [TRT] Detected 6 inputs and 1 output network tensors.
[02/16/2024-06:57:18] [I] [TRT] Total Host Persistent Memory: 693776
[02/16/2024-06:57:18] [I] [TRT] Total Device Persistent Memory: 0
[02/16/2024-06:57:18] [I] [TRT] Total Scratch Memory: 17487360
[02/16/2024-06:57:18] [I] [TRT] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 9 MiB, GPU 96 MiB
[02/16/2024-06:57:18] [I] [TRT] [BlockAssignment] Started assigning block shifts. This will take 297 steps to complete.
[02/16/2024-06:57:18] [I] [TRT] [BlockAssignment] Algorithm ShiftNTopDown took 44.0758ms to assign 8 blocks to 297 nodes requiring 56313344 bytes.
[02/16/2024-06:57:18] [I] [TRT] Total Activation Memory: 56312320
[02/16/2024-06:57:18] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in building engine: CPU +0, GPU +64, now: CPU 0, GPU 64 (MiB)
[02/16/2024-06:57:18] [I] Engine built in 216.186 sec.
[02/16/2024-06:57:19] [I] [TRT] Loaded engine size: 50 MiB
[02/16/2024-06:57:19] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +48, now: CPU 0, GPU 48 (MiB)
[02/16/2024-06:57:19] [I] Engine deserialized in 0.0500841 sec.
[02/16/2024-06:57:19] [I] [TRT] [MS] Running engine with multi stream info
[02/16/2024-06:57:19] [I] [TRT] [MS] Number of aux streams is 2
[02/16/2024-06:57:19] [I] [TRT] [MS] Number of total worker streams is 3
[02/16/2024-06:57:19] [I] [TRT] [MS] The main stream provided by execute/enqueue calls is the first worker stream
[02/16/2024-06:57:19] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +1, GPU +54, now: CPU 1, GPU 102 (MiB)
[02/16/2024-06:57:19] [I] Setting persistentCacheLimit to 0 bytes.
[02/16/2024-06:57:19] [I] Using random values for input input_0
[02/16/2024-06:57:19] [I] Input binding for input_0 with dimensions 1x6x3x128x352 is created.
[02/16/2024-06:57:19] [I] Using random values for input input_1
[02/16/2024-06:57:19] [I] Input binding for input_1 with dimensions 1x6x3x3 is created.
[02/16/2024-06:57:19] [I] Using random values for input input_2
[02/16/2024-06:57:19] [I] Input binding for input_2 with dimensions 1x6x3 is created.
[02/16/2024-06:57:19] [I] Using random values for input input_5
[02/16/2024-06:57:19] [I] Input binding for input_5 with dimensions 1x6x3 is created.
[02/16/2024-06:57:19] [I] Using random values for input input_6
[02/16/2024-06:57:19] [I] Input binding for input_6 with dimensions 1x6x3x3 is created.
[02/16/2024-06:57:19] [I] Using random values for input input_7
[02/16/2024-06:57:19] [I] Input binding for input_7 with dimensions 1x6x3x3 is created.
[02/16/2024-06:57:19] [I] Output binding for output with dimensions 1x2x104x104 is created.
[02/16/2024-06:57:19] [I] Starting inference
[02/16/2024-06:57:19] [E] Error[7]: [shapeMachine.cpp::executeContinuation::864] Error Code 7: Internal Error (IShuffleLayer /Reshape_12: reshaping failed for tensor: /GatherND_3_output_0 reshape would change volume 1516 to 24256 Instruction: RESHAPE{379 4} {379 64}.)
[02/16/2024-06:57:19] [E] Error occurred during inference
&&&& FAILED TensorRT.trtexec [TensorRT v8611] # ./trtexec --onnx=/home/nvidia/models/repaired.onnx --saveEngine=model_transposed_constant_folding_bsz1_a0_b0_repaired.trt

ONNX 1-
model_transposed_constant_folding_bsz1_a0_b0 (copy).txt (52.0 MB)
ONNX 2 -

repaired.txt (48.5 MB)

Topic		Replies	Views
I do not get any performance improvement after using TensorRT provider for object detection model Jetson Nano tensorrt , onnx	7	1389	July 12, 2022
LSTM ONNX to TensorRT mismatched outputs TensorRT tensorrt	3	934	September 29, 2022
ONNX model and TensorRT engine works differently TensorRT	5	713	February 20, 2023
I am trying to convert the ONNX SSD mobilnet v3 model into TensorRT Engine. I am getting the below error Jetson TX2 tensorrt , tensorflow	24	3679	February 17, 2022
Process killed during tensorrt conversion on Jetson orin NX (8 GB) Jetson Orin NX tensorrt	15	687	April 30, 2024
TensorRT does not see all GPU memory TensorRT	1	987	November 18, 2022
Assertion Error in buildMemGraph: 0 (mg.nodes[mg.regionIndices[outputRegion]].size == mg.nodes[mg.regionIndices[inputRegion]].size) TensorRT	10	1290	October 12, 2021
ONNX to TensorRT conversion error: Error4 from graphShapeAnalyzer.cpp, (ITopKLayer /TopK: /TopK: K exceeds the maximum value allowed (3840).) TensorRT tensorrt , onnx , jetson-nano	2	409	May 21, 2024
Transferring ONNX Softmax operation to TensorRT TensorRT	23	3925	October 12, 2021
Error loading .trt model Jetson AGX Orin tensorrt	7	96	November 6, 2024

Tensorrt fails shapeMachine.cpp

Description

Environment

Relevant Files

Steps To Reproduce

Related topics