ERROR: [TRT]: 10: Could not find any implementation for node /0/model.24/Expand

Description

I can’t generate engine using TensorRT 8.6.2 in docker nvcr.io/nvidia/deepstream:6.4-triton-multiarch.

Issue may be similar to: Error Code 10: Internal Error (Could not find any implementation for node {ForeignNode[668...Mul_497] or Error Code 10: Internal Error (Could not find any implementation for node PWN(/model.0/act/Sigmoid).)
This bug exists only on Jetson platform, exactly the same engine can be built without any problem on x86 machine.

The same engene build tell on al older image nvcr.io/nvidia/deepstream-l4t:6.2-samples

Environment

Device: NVIDIA Jetson AGX Orin Developer kit
Host system: Jetpack 6.0 DP [L4T 36.2.0]
Baremetal or Container (if container which image + tag): Containernvcr.io/nvidia/deepstream:6.4-triton-multiarch
TensorRT Version: 8.6.2

Steps To Reproduce

  1. Open docker docker run --gpus=all -it --rm -v ./:/workspace nvcr.io/nvidia/deepstream:6.4-triton-multiarch bash
  2. Install dependencies like onnx
  3. Build onnx file from model
  4. Run command trtexec --onnx=best.onnx --verbose

Results

Config Info from TensorRT

$ trtexec --onnx=best.onnx --verbose
&&&& RUNNING TensorRT.trtexec [TensorRT v8602] # trtexec --onnx=best.onnx --verbose
[02/09/2024-10:22:42] [I] === Model Options ===
[02/09/2024-10:22:42] [I] Format: ONNX
[02/09/2024-10:22:42] [I] Model: best.onnx
[02/09/2024-10:22:42] [I] Output:
[02/09/2024-10:22:42] [I] === Build Options ===
[02/09/2024-10:22:42] [I] Max batch: explicit batch
[02/09/2024-10:22:42] [I] Memory Pools: workspace: default, dlaSRAM: default, dlaLocalDRAM: default, dlaGlobalDRAM: default
[02/09/2024-10:22:42] [I] minTiming: 1
[02/09/2024-10:22:42] [I] avgTiming: 8
[02/09/2024-10:22:42] [I] Precision: FP32
[02/09/2024-10:22:42] [I] LayerPrecisions: 
[02/09/2024-10:22:42] [I] Layer Device Types: 
[02/09/2024-10:22:42] [I] Calibration: 
[02/09/2024-10:22:42] [I] Refit: Disabled
[02/09/2024-10:22:42] [I] Version Compatible: Disabled
[02/09/2024-10:22:42] [I] ONNX Native InstanceNorm: Disabled
[02/09/2024-10:22:42] [I] TensorRT runtime: full
[02/09/2024-10:22:42] [I] Lean DLL Path: 
[02/09/2024-10:22:42] [I] Tempfile Controls: { in_memory: allow, temporary: allow }
[02/09/2024-10:22:42] [I] Exclude Lean Runtime: Disabled
[02/09/2024-10:22:42] [I] Sparsity: Disabled
[02/09/2024-10:22:42] [I] Safe mode: Disabled
[02/09/2024-10:22:42] [I] Build DLA standalone loadable: Disabled
[02/09/2024-10:22:42] [I] Allow GPU fallback for DLA: Disabled
[02/09/2024-10:22:42] [I] DirectIO mode: Disabled
[02/09/2024-10:22:42] [I] Restricted mode: Disabled
[02/09/2024-10:22:42] [I] Skip inference: Disabled
[02/09/2024-10:22:42] [I] Save engine: 
[02/09/2024-10:22:42] [I] Load engine: 
[02/09/2024-10:22:42] [I] Profiling verbosity: 0
[02/09/2024-10:22:42] [I] Tactic sources: Using default tactic sources
[02/09/2024-10:22:42] [I] timingCacheMode: local
[02/09/2024-10:22:42] [I] timingCacheFile: 
[02/09/2024-10:22:42] [I] Heuristic: Disabled
[02/09/2024-10:22:42] [I] Preview Features: Use default preview flags.
[02/09/2024-10:22:42] [I] MaxAuxStreams: -1
[02/09/2024-10:22:42] [I] BuilderOptimizationLevel: -1
[02/09/2024-10:22:42] [I] Input(s)s format: fp32:CHW
[02/09/2024-10:22:42] [I] Output(s)s format: fp32:CHW
[02/09/2024-10:22:42] [I] Input build shapes: model
[02/09/2024-10:22:42] [I] Input calibration shapes: model
[02/09/2024-10:22:42] [I] === System Options ===
[02/09/2024-10:22:42] [I] Device: 0
[02/09/2024-10:22:42] [I] DLACore: 
[02/09/2024-10:22:42] [I] Plugins:
[02/09/2024-10:22:42] [I] setPluginsToSerialize:
[02/09/2024-10:22:42] [I] dynamicPlugins:
[02/09/2024-10:22:42] [I] ignoreParsedPluginLibs: 0
[02/09/2024-10:22:42] [I] 
[02/09/2024-10:22:42] [I] === Inference Options ===
[02/09/2024-10:22:42] [I] Batch: Explicit
[02/09/2024-10:22:42] [I] Input inference shapes: model
[02/09/2024-10:22:42] [I] Iterations: 10
[02/09/2024-10:22:42] [I] Duration: 3s (+ 200ms warm up)
[02/09/2024-10:22:42] [I] Sleep time: 0ms
[02/09/2024-10:22:42] [I] Idle time: 0ms
[02/09/2024-10:22:42] [I] Inference Streams: 1
[02/09/2024-10:22:42] [I] ExposeDMA: Disabled
[02/09/2024-10:22:42] [I] Data transfers: Enabled
[02/09/2024-10:22:42] [I] Spin-wait: Disabled
[02/09/2024-10:22:42] [I] Multithreading: Disabled
[02/09/2024-10:22:42] [I] CUDA Graph: Disabled
[02/09/2024-10:22:42] [I] Separate profiling: Disabled
[02/09/2024-10:22:42] [I] Time Deserialize: Disabled
[02/09/2024-10:22:42] [I] Time Refit: Disabled
[02/09/2024-10:22:42] [I] NVTX verbosity: 0
[02/09/2024-10:22:42] [I] Persistent Cache Ratio: 0
[02/09/2024-10:22:42] [I] Inputs:
[02/09/2024-10:22:42] [I] === Reporting Options ===
[02/09/2024-10:22:42] [I] Verbose: Enabled
[02/09/2024-10:22:42] [I] Averages: 10 inferences
[02/09/2024-10:22:42] [I] Percentiles: 90,95,99
[02/09/2024-10:22:42] [I] Dump refittable layers:Disabled
[02/09/2024-10:22:42] [I] Dump output: Disabled
[02/09/2024-10:22:42] [I] Profile: Disabled
[02/09/2024-10:22:42] [I] Export timing to JSON file: 
[02/09/2024-10:22:42] [I] Export output to JSON file: 
[02/09/2024-10:22:42] [I] Export profile to JSON file: 
[02/09/2024-10:22:42] [I] 
[02/09/2024-10:22:42] [I] === Device Information ===
[02/09/2024-10:22:42] [I] Selected Device: Orin
[02/09/2024-10:22:42] [I] Compute Capability: 8.7
[02/09/2024-10:22:42] [I] SMs: 16
[02/09/2024-10:22:42] [I] Device Global Memory: 30697 MiB
[02/09/2024-10:22:42] [I] Shared Memory per SM: 164 KiB
[02/09/2024-10:22:42] [I] Memory Bus Width: 256 bits (ECC disabled)
[02/09/2024-10:22:42] [I] Application Compute Clock Rate: 1.3 GHz
[02/09/2024-10:22:42] [I] Application Memory Clock Rate: 0.816 GHz
[02/09/2024-10:22:42] [I] 
[02/09/2024-10:22:42] [I] Note: The application clock rates do not reflect the actual clock rates that the GPU is currently running at.
[02/09/2024-10:22:42] [I] 
[02/09/2024-10:22:42] [I] TensorRT version: 8.6.2

Error I get:

[02/09/2024-10:39:13] [V] [TRT] =============== Computing costs for /0/model.26/m/m.1/cv2/conv/Conv
[02/09/2024-10:39:13] [V] [TRT] *************** Autotuning format combination: Float(512000,1600,40,1) -> Float(512000,1600,40,1) ***************
[02/09/2024-10:39:13] [V] [TRT] *************** Autotuning format combination: Float(512000,1,12800,320) -> Float(512000,1,12800,320) ***************
[02/09/2024-10:39:13] [V] [TRT] *************** Autotuning format combination: Float(128000,1:4,3200,80) -> Float(512000,1600,40,1) ***************
[02/09/2024-10:39:13] [V] [TRT] *************** Autotuning format combination: Float(128000,1:4,3200,80) -> Float(128000,1:4,3200,80) ***************
[02/09/2024-10:39:13] [V] [TRT] =============== Computing costs for /0/model.33/Expand
[02/09/2024-10:39:13] [V] [TRT] *************** Autotuning format combination: Float(1,1) -> Float(80,1) ***************
[02/09/2024-10:39:13] [V] [TRT] --------------- Timing Runner: /0/model.33/Expand (Padding[0x8000000c])
[02/09/2024-10:39:13] [V] [TRT] Padding has no valid tactics for this config, skipping
[02/09/2024-10:39:13] [V] [TRT] --------------- Timing Runner: /0/model.33/Expand (Slice[0x8000001b])
[02/09/2024-10:39:13] [V] [TRT] Skipping tactic 0x0000000000000000 due to exception cudaEventElapsedTime
[02/09/2024-10:39:13] [V] [TRT] /0/model.33/Expand (Slice[0x8000001b]) profiling completed in 0.0023289 seconds. Fastest Tactic: 0xd15ea5edd15ea5ed Time: inf
[02/09/2024-10:39:13] [V] [TRT] Deleting timing cache: 355 entries, served 1129 hits since creation.
[02/09/2024-10:39:13] [E] Error[10]: Could not find any implementation for node /0/model.33/Expand.
[02/09/2024-10:39:13] [E] Error[10]: [optimizer.cpp::computeCosts::3869] Error Code 10: Internal Error (Could not find any implementation for node /0/model.33/Expand.)
[02/09/2024-10:39:13] [E] Engine could not be created from network
[02/09/2024-10:39:13] [E] Building engine failed
[02/09/2024-10:39:13] [E] Failed to create engine from model or file.
[02/09/2024-10:39:13] [E] Engine set up failed
&&&& FAILED TensorRT.trtexec [TensorRT v8602] # trtexec --onnx=best.onnx --verbose
1 Like

Hi @pkot ,
This might be a deepstream issue, would you mind checking there.

Thanks

@AakankshaS

What do you mean by checking there? Should I open another topic in deepstream subforum? DeepStream SDK - NVIDIA Developer Forums

The parameters for starting docker seem to have issues, try the following command line

docker run -it --rm --net=host --runtime nvidia  -e DISPLAY=$DISPLAY -w /opt/nvidia/deepstream/deepstream-6.4 -v /tmp/.X11-unix/:/tmp/.X11-unix nvcr.io/nvidia/deepstream:6.4-triton-multiarch

It did not work

I launched docker:

export DISPLAY=:1
xhost +local:
docker run -it --rm --net=host --runtime nvidia  -e DISPLAY=$DISPLAY -w /opt/nvidia/deepstream/deepstream-6.4 -v /tmp/.X11-unix/:/tmp/.X11-unix nvcr.io/nvidia/deepstream:6.4-triton-multiarch

And went through commands:

apt install -y kmod
git clone https://github.com/marcoslucianops/DeepStream-Yolo.git
cd DeepStream-Yolo/
git clone https://github.com/ultralytics/yolov5.git
cd yolov5
pip3 install cmake
pip3 install -r requirements.txt
pip3 install onnx onnxsim onnxruntime
cp ./../utils/export_yoloV5.py ./
wget https://github.com/ultralytics/yolov5/releases/download/v7.0/yolov5s.pt
python3 export_yoloV5.py -w yolov5s.pt --dynamic
# Build
cd ..
CUDA_VER=12.2 make -C nvdsinfer_custom_impl_Yolo
cp ./yolov5/yolov5s.onnx ./
cp ./yolov5/labels.txt  ./
sed -i.bak 's/config_infer_primary.txt/config_infer_primary_yoloV5.txt/g' ./deepstream_app_config.txt
deepstream-app -c deepstream_app_config.txt
deepstream-app -c deepstream_app_config.txt

And I still got error:

ERROR: [TRT]: 10: Could not find any implementation for node /0/model.24/Range.
ERROR: [TRT]: 10: [optimizer.cpp::computeCosts::3869] Error Code 10: Internal Error (Could not find any implementation for node /0/model.24/Range.)
Building engine failed

This is a cuda driver bug in Jetson and we fixed it in our latest internal code, should be come with the next release(probably GA ). closed this.

Issue will be solved with the next release.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.