ERROR: [TRT]: 10: Could not find any implementation for node /0/model.24/Expand

Description

I can’t generate engine using TensorRT 8.6.2 in docker nvcr.io/nvidia/deepstream:6.4-triton-multiarch.

Issue may be similar to: Error Code 10: Internal Error (Could not find any implementation for node {ForeignNode[668...Mul_497] or Error Code 10: Internal Error (Could not find any implementation for node PWN(/model.0/act/Sigmoid).)
This bug exists only on Jetson platform, exactly the same engine can be built without any problem on x86 machine.

The same engene build tell on al older image nvcr.io/nvidia/deepstream-l4t:6.2-samples

Environment

Device: NVIDIA Jetson AGX Orin Developer kit
Host system: Jetpack 6.0 DP [L4T 36.2.0]
Baremetal or Container (if container which image + tag): Containernvcr.io/nvidia/deepstream:6.4-triton-multiarch
TensorRT Version: 8.6.2

Steps To Reproduce

  1. Open docker docker run --gpus=all -it --rm -v ./:/workspace nvcr.io/nvidia/deepstream:6.4-triton-multiarch bash
  2. Install dependencies like onnx
  3. Build onnx file from model
  4. Run command trtexec --onnx=best.onnx --verbose

Results

Config Info from TensorRT

$ trtexec --onnx=best.onnx --verbose
&&&& RUNNING TensorRT.trtexec [TensorRT v8602] # trtexec --onnx=best.onnx --verbose
[02/09/2024-10:22:42] [I] === Model Options ===
[02/09/2024-10:22:42] [I] Format: ONNX
[02/09/2024-10:22:42] [I] Model: best.onnx
[02/09/2024-10:22:42] [I] Output:
[02/09/2024-10:22:42] [I] === Build Options ===
[02/09/2024-10:22:42] [I] Max batch: explicit batch
[02/09/2024-10:22:42] [I] Memory Pools: workspace: default, dlaSRAM: default, dlaLocalDRAM: default, dlaGlobalDRAM: default
[02/09/2024-10:22:42] [I] minTiming: 1
[02/09/2024-10:22:42] [I] avgTiming: 8
[02/09/2024-10:22:42] [I] Precision: FP32
[02/09/2024-10:22:42] [I] LayerPrecisions: 
[02/09/2024-10:22:42] [I] Layer Device Types: 
[02/09/2024-10:22:42] [I] Calibration: 
[02/09/2024-10:22:42] [I] Refit: Disabled
[02/09/2024-10:22:42] [I] Version Compatible: Disabled
[02/09/2024-10:22:42] [I] ONNX Native InstanceNorm: Disabled
[02/09/2024-10:22:42] [I] TensorRT runtime: full
[02/09/2024-10:22:42] [I] Lean DLL Path: 
[02/09/2024-10:22:42] [I] Tempfile Controls: { in_memory: allow, temporary: allow }
[02/09/2024-10:22:42] [I] Exclude Lean Runtime: Disabled
[02/09/2024-10:22:42] [I] Sparsity: Disabled
[02/09/2024-10:22:42] [I] Safe mode: Disabled
[02/09/2024-10:22:42] [I] Build DLA standalone loadable: Disabled
[02/09/2024-10:22:42] [I] Allow GPU fallback for DLA: Disabled
[02/09/2024-10:22:42] [I] DirectIO mode: Disabled
[02/09/2024-10:22:42] [I] Restricted mode: Disabled
[02/09/2024-10:22:42] [I] Skip inference: Disabled
[02/09/2024-10:22:42] [I] Save engine: 
[02/09/2024-10:22:42] [I] Load engine: 
[02/09/2024-10:22:42] [I] Profiling verbosity: 0
[02/09/2024-10:22:42] [I] Tactic sources: Using default tactic sources
[02/09/2024-10:22:42] [I] timingCacheMode: local
[02/09/2024-10:22:42] [I] timingCacheFile: 
[02/09/2024-10:22:42] [I] Heuristic: Disabled
[02/09/2024-10:22:42] [I] Preview Features: Use default preview flags.
[02/09/2024-10:22:42] [I] MaxAuxStreams: -1
[02/09/2024-10:22:42] [I] BuilderOptimizationLevel: -1
[02/09/2024-10:22:42] [I] Input(s)s format: fp32:CHW
[02/09/2024-10:22:42] [I] Output(s)s format: fp32:CHW
[02/09/2024-10:22:42] [I] Input build shapes: model
[02/09/2024-10:22:42] [I] Input calibration shapes: model
[02/09/2024-10:22:42] [I] === System Options ===
[02/09/2024-10:22:42] [I] Device: 0
[02/09/2024-10:22:42] [I] DLACore: 
[02/09/2024-10:22:42] [I] Plugins:
[02/09/2024-10:22:42] [I] setPluginsToSerialize:
[02/09/2024-10:22:42] [I] dynamicPlugins:
[02/09/2024-10:22:42] [I] ignoreParsedPluginLibs: 0
[02/09/2024-10:22:42] [I] 
[02/09/2024-10:22:42] [I] === Inference Options ===
[02/09/2024-10:22:42] [I] Batch: Explicit
[02/09/2024-10:22:42] [I] Input inference shapes: model
[02/09/2024-10:22:42] [I] Iterations: 10
[02/09/2024-10:22:42] [I] Duration: 3s (+ 200ms warm up)
[02/09/2024-10:22:42] [I] Sleep time: 0ms
[02/09/2024-10:22:42] [I] Idle time: 0ms
[02/09/2024-10:22:42] [I] Inference Streams: 1
[02/09/2024-10:22:42] [I] ExposeDMA: Disabled
[02/09/2024-10:22:42] [I] Data transfers: Enabled
[02/09/2024-10:22:42] [I] Spin-wait: Disabled
[02/09/2024-10:22:42] [I] Multithreading: Disabled
[02/09/2024-10:22:42] [I] CUDA Graph: Disabled
[02/09/2024-10:22:42] [I] Separate profiling: Disabled
[02/09/2024-10:22:42] [I] Time Deserialize: Disabled
[02/09/2024-10:22:42] [I] Time Refit: Disabled
[02/09/2024-10:22:42] [I] NVTX verbosity: 0
[02/09/2024-10:22:42] [I] Persistent Cache Ratio: 0
[02/09/2024-10:22:42] [I] Inputs:
[02/09/2024-10:22:42] [I] === Reporting Options ===
[02/09/2024-10:22:42] [I] Verbose: Enabled
[02/09/2024-10:22:42] [I] Averages: 10 inferences
[02/09/2024-10:22:42] [I] Percentiles: 90,95,99
[02/09/2024-10:22:42] [I] Dump refittable layers:Disabled
[02/09/2024-10:22:42] [I] Dump output: Disabled
[02/09/2024-10:22:42] [I] Profile: Disabled
[02/09/2024-10:22:42] [I] Export timing to JSON file: 
[02/09/2024-10:22:42] [I] Export output to JSON file: 
[02/09/2024-10:22:42] [I] Export profile to JSON file: 
[02/09/2024-10:22:42] [I] 
[02/09/2024-10:22:42] [I] === Device Information ===
[02/09/2024-10:22:42] [I] Selected Device: Orin
[02/09/2024-10:22:42] [I] Compute Capability: 8.7
[02/09/2024-10:22:42] [I] SMs: 16
[02/09/2024-10:22:42] [I] Device Global Memory: 30697 MiB
[02/09/2024-10:22:42] [I] Shared Memory per SM: 164 KiB
[02/09/2024-10:22:42] [I] Memory Bus Width: 256 bits (ECC disabled)
[02/09/2024-10:22:42] [I] Application Compute Clock Rate: 1.3 GHz
[02/09/2024-10:22:42] [I] Application Memory Clock Rate: 0.816 GHz
[02/09/2024-10:22:42] [I] 
[02/09/2024-10:22:42] [I] Note: The application clock rates do not reflect the actual clock rates that the GPU is currently running at.
[02/09/2024-10:22:42] [I] 
[02/09/2024-10:22:42] [I] TensorRT version: 8.6.2

Error I get:

[02/09/2024-10:39:13] [V] [TRT] =============== Computing costs for /0/model.26/m/m.1/cv2/conv/Conv
[02/09/2024-10:39:13] [V] [TRT] *************** Autotuning format combination: Float(512000,1600,40,1) -> Float(512000,1600,40,1) ***************
[02/09/2024-10:39:13] [V] [TRT] *************** Autotuning format combination: Float(512000,1,12800,320) -> Float(512000,1,12800,320) ***************
[02/09/2024-10:39:13] [V] [TRT] *************** Autotuning format combination: Float(128000,1:4,3200,80) -> Float(512000,1600,40,1) ***************
[02/09/2024-10:39:13] [V] [TRT] *************** Autotuning format combination: Float(128000,1:4,3200,80) -> Float(128000,1:4,3200,80) ***************
[02/09/2024-10:39:13] [V] [TRT] =============== Computing costs for /0/model.33/Expand
[02/09/2024-10:39:13] [V] [TRT] *************** Autotuning format combination: Float(1,1) -> Float(80,1) ***************
[02/09/2024-10:39:13] [V] [TRT] --------------- Timing Runner: /0/model.33/Expand (Padding[0x8000000c])
[02/09/2024-10:39:13] [V] [TRT] Padding has no valid tactics for this config, skipping
[02/09/2024-10:39:13] [V] [TRT] --------------- Timing Runner: /0/model.33/Expand (Slice[0x8000001b])
[02/09/2024-10:39:13] [V] [TRT] Skipping tactic 0x0000000000000000 due to exception cudaEventElapsedTime
[02/09/2024-10:39:13] [V] [TRT] /0/model.33/Expand (Slice[0x8000001b]) profiling completed in 0.0023289 seconds. Fastest Tactic: 0xd15ea5edd15ea5ed Time: inf
[02/09/2024-10:39:13] [V] [TRT] Deleting timing cache: 355 entries, served 1129 hits since creation.
[02/09/2024-10:39:13] [E] Error[10]: Could not find any implementation for node /0/model.33/Expand.
[02/09/2024-10:39:13] [E] Error[10]: [optimizer.cpp::computeCosts::3869] Error Code 10: Internal Error (Could not find any implementation for node /0/model.33/Expand.)
[02/09/2024-10:39:13] [E] Engine could not be created from network
[02/09/2024-10:39:13] [E] Building engine failed
[02/09/2024-10:39:13] [E] Failed to create engine from model or file.
[02/09/2024-10:39:13] [E] Engine set up failed
&&&& FAILED TensorRT.trtexec [TensorRT v8602] # trtexec --onnx=best.onnx --verbose
1 Like

Hi @pkot ,
This might be a deepstream issue, would you mind checking there.

Thanks

@AakankshaS

What do you mean by checking there? Should I open another topic in deepstream subforum? DeepStream SDK - NVIDIA Developer Forums