Process Killed when Generating a TensorRT Engine for the ViT models

Hello. Can I get an explanation how to properly generate a TensorRT Engine from the onnx files provided for the ocdr sample?

• Hardware Platform (Jetson / GPU)
NVIDIA Jetson Orin NX (16GB ram)
• DeepStream Version
Deepstream 7.0 (in a docker)
• JetPack Version (valid for Jetson only)
Jetpack 6.0 (L4T 36.3.0)
• TensorRT Version
8.6.2.3
• NVIDIA GPU Driver Version (valid for GPU only)
How do I find this?
• Issue Type( questions, new requirements, bugs)
Question/bug?
• How to reproduce the issue ? (This is for bugs. Including which sample app is using, the configuration files content, the command line used and other details for reproducing)

I started a deepstream 7.0 triton multiarch docker.
I installed the libopencv-dev.
I download different models from the following link.
I gitclone the Nvidia-ocdr github for the sample. I make the libraries.
Everything works perfectly with the v1.0 versions of models and for the 2.0 ocr models. But when trying to generate the engine with the ocd ViT models, after around 15minutes of loading in tensorRT, the process is “Killed”. Am I doing something wrong? Here is my command:

/usr/src/tensorrt/bin/trtexec --onnx=/opt/nvidia/deepstream/deepstream/models/ocdnet_fan_tiny_2x_icdar_pruned.onnx --minShapes=input:1x3x736x1280 --optShapes=input:1x3x736x1280 --maxShapes=input:1x3x736x1280 --fp16 --saveEngine=ocdnetvit.fp16.engine

The output I get is following:

root@**********:/~/volume# /usr/src/tensorrt/bin/trtexec --onnx=/opt/nvidia/deepstream/deepstream/models/ocdnet_fan_tiny_2x_icdar_pruned.onnx --minShapes=input:1x3x736x1280 --optShapes=input:1x3x736x1280 --maxShapes=input:1x3x736x1280 --fp16 --saveEngine=ocdnetvit.fp16.engine
&&&& RUNNING TensorRT.trtexec [TensorRT v8602] # /usr/src/tensorrt/bin/trtexec --onnx=/opt/nvidia/deepstream/deepstream/models/ocdnet_fan_tiny_2x_icdar_pruned.onnx --minShapes=input:1x3x736x1280 --optShapes=input:1x3x736x1280 --maxShapes=input:1x3x736x1280 --fp16 --saveEngine=ocdnetvit.fp16.engine
[10/03/2024-12:26:24] [I] === Model Options ===
[10/03/2024-12:26:24] [I] Format: ONNX
[10/03/2024-12:26:24] [I] Model: /opt/nvidia/deepstream/deepstream/models/ocdnet_fan_tiny_2x_icdar_pruned.onnx
[10/03/2024-12:26:24] [I] Output:
[10/03/2024-12:26:24] [I] === Build Options ===
[10/03/2024-12:26:24] [I] Max batch: explicit batch
[10/03/2024-12:26:24] [I] Memory Pools: workspace: default, dlaSRAM: default, dlaLocalDRAM: default, dlaGlobalDRAM: default
[10/03/2024-12:26:24] [I] minTiming: 1
[10/03/2024-12:26:24] [I] avgTiming: 8
[10/03/2024-12:26:24] [I] Precision: FP32+FP16
[10/03/2024-12:26:24] [I] LayerPrecisions: 
[10/03/2024-12:26:24] [I] Layer Device Types: 
[10/03/2024-12:26:24] [I] Calibration: 
[10/03/2024-12:26:24] [I] Refit: Disabled

[10/03/2024-12:26:24] [I] Restricted mode: Disabled
[10/03/2024-12:26:24] [I] Skip inference: Disabled
[10/03/2024-12:26:24] [I] Save engine: ocdnetvit.fp16.engine
[10/03/2024-12:26:24] [I] Load engine: 
[10/03/2024-12:26:24] [I] Profiling verbosity: 0
[10/03/2024-12:26:24] [I] Tactic sources: Using default tactic sources
[10/03/2024-12:26:24] [I] timingCacheMode: local
[10/03/2024-12:26:24] [I] timingCacheFile: 
[10/03/2024-12:26:24] [I] Heuristic: Disabled
[10/03/2024-12:26:24] [I] Preview Features: Use default preview flags.
[10/03/2024-12:26:24] [I] MaxAuxStreams: -1
[10/03/2024-12:26:24] [I] BuilderOptimizationLevel: -1
[10/03/2024-12:26:24] [I] Input(s)s format: fp32:CHW
[10/03/2024-12:26:24] [I] Output(s)s format: fp32:CHW
[10/03/2024-12:26:24] [I] Input build shape: input=1x3x736x1280+1x3x736x1280+1x3x736x1280
[10/03/2024-12:26:24] [I] Input calibration shapes: model
[10/03/2024-12:26:24] [I] === System Options ===
[10/03/2024-12:26:24] [I] Device: 0
[10/03/2024-12:26:24] [I] DLACore: 
[10/03/2024-12:26:24] [I] Plugins:
[10/03/2024-12:26:24] [I] setPluginsToSerialize:
[10/03/2024-12:26:24] [I] dynamicPlugins:
[10/03/2024-12:26:24] [I] ignoreParsedPluginLibs: 0
[10/03/2024-12:26:24] [I] 
[10/03/2024-12:26:24] [I] === Inference Options ===
[10/03/2024-12:26:24] [I] Batch: Explicit
[10/03/2024-12:26:24] [I] Input inference shape: input=1x3x736x1280
[10/03/2024-12:26:24] [I] Iterations: 10
[10/03/2024-12:26:24] [I] Duration: 3s (+ 200ms warm up)
[10/03/2024-12:26:24] [I] Sleep time: 0ms
[10/03/2024-12:26:24] [I] Idle time: 0ms
[10/03/2024-12:26:24] [I] Inference Streams: 1
[10/03/2024-12:26:24] [I] Data transfers: Enabled
[10/03/2024-12:26:24] [I] Spin-wait: Disabled
[10/03/2024-12:26:24] [I] Multithreading: Disabled
[10/03/2024-12:26:24] [I] CUDA Graph: Disabled
[10/03/2024-12:26:24] [I] Separate profiling: Disabled
[10/03/2024-12:26:24] [I] Time Deserialize: Disabled
[10/03/2024-12:26:24] [I] Time Refit: Disabled
[10/03/2024-12:26:24] [I] NVTX verbosity: 0
[10/03/2024-12:26:24] [I] Persistent Cache Ratio: 0
[10/03/2024-12:26:24] [I] Inputs:
[10/03/2024-12:26:24] [I] === Reporting Options ===
[10/03/2024-12:26:24] [I] Verbose: Disabled
[10/03/2024-12:26:24] [I] Averages: 10 inferences
[10/03/2024-12:26:24] [I] Percentiles: 90,95,99
[10/03/2024-12:26:24] [I] Dump refittable layers:Disabled
[10/03/2024-12:26:24] [I] Dump output: Disabled
[10/03/2024-12:26:24] [I] Profile: Disabled
[10/03/2024-12:26:24] [I] Export timing to JSON file: 
[10/03/2024-12:26:24] [I] Export output to JSON file: 
[10/03/2024-12:26:24] [I] Export profile to JSON file: 
[10/03/2024-12:26:24] [I] 
[10/03/2024-12:26:24] [I] === Device Information ===
[10/03/2024-12:26:24] [I] Selected Device: Orin
[10/03/2024-12:26:24] [I] Compute Capability: 8.7
[10/03/2024-12:26:24] [I] SMs: 8
[10/03/2024-12:26:24] [I] Device Global Memory: 15656 MiB
[10/03/2024-12:26:24] [I] Shared Memory per SM: 164 KiB
[10/03/2024-12:26:24] [I] Memory Bus Width: 256 bits (ECC disabled)
[10/03/2024-12:26:24] [I] Application Compute Clock Rate: 0.918 GHz
[10/03/2024-12:26:24] [I] Application Memory Clock Rate: 0.918 GHz
[10/03/2024-12:26:24] [I] 
[10/03/2024-12:26:24] [I] Note: The application clock rates do not reflect the actual clock rates that the GPU is currently running at.
[10/03/2024-12:26:24] [I] 
[10/03/2024-12:26:24] [I] TensorRT version: 8.6.2
[10/03/2024-12:26:24] [I] Loading standard plugins
[10/03/2024-12:26:24] [I] [TRT] [MemUsageChange] Init CUDA: CPU +2, GPU +0, now: CPU 33, GPU 4900 (MiB)
[10/03/2024-12:26:29] [I] [TRT] [MemUsageChange] Init builder kernel library: CPU +1154, GPU +1112, now: CPU 1223, GPU 6055 (MiB)
[10/03/2024-12:26:29] [I] Start parsing network model.
[10/03/2024-12:26:29] [I] [TRT] ----------------------------------------------------------------
[10/03/2024-12:26:29] [I] [TRT] Input filename:   /opt/nvidia/deepstream/deepstream/models/ocdnet_fan_tiny_2x_icdar_pruned.onnx
[10/03/2024-12:26:29] [I] [TRT] ONNX IR version:  0.0.8
[10/03/2024-12:26:29] [I] [TRT] Opset version:    17
[10/03/2024-12:26:29] [I] [TRT] Producer name:    pytorch
[10/03/2024-12:26:29] [I] [TRT] Producer version: 1.14.0
[10/03/2024-12:26:29] [I] [TRT] Domain:           
[10/03/2024-12:26:29] [I] [TRT] Model version:    0
[10/03/2024-12:26:29] [I] [TRT] Doc string:       
[10/03/2024-12:26:29] [I] [TRT] ----------------------------------------------------------------
[10/03/2024-12:26:29] [W] [TRT] onnx2trt_utils.cpp:372: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
[10/03/2024-12:26:29] [W] [TRT] onnx2trt_utils.cpp:400: One or more weights outside the range of INT32 was clamped
[10/03/2024-12:26:29] [I] Finished parsing network model. Parse time: 0.154953
[10/03/2024-12:26:29] [I] [TRT] Graph optimization time: 0.124676 seconds.
[10/03/2024-12:26:29] [I] [TRT] Local timing cache in use. Profiling results in this builder pass will not be stored.
Killed

Please refer this sample.

This may be due to the model version. This sample uses the tested model.

1 Like

Thank you very much for your answer.

I am running the NVOCDR sample inside a Deepstream 7.0 docker container.
I have TensorRT version 8.6.2.3-1+cuda12.2.

I cannot cmake the 8.6 version as described in the github because I am in the deepstream container.
I was unable to make the NVOCDR sample work in any other environment (DS 6.4, DS 6.3 or without container).

I still want to use the ViT versions of the models as I saw considerable imporvement (when infering on the NVidia inference API) compared to the 1.0 models.

Is there a way to get more logs from TRT or find a solution to making it work in DS 7.0 container? Do you have any advice how should I debug this issue?

Also, is there a limit to the resolution that can be set when generating the ocd and ocr TRT engines?

This is optional, if you use DS-7.0, no need to compile TRT-8.6 manually.

Add the verbose parameter to get more logs.

/usr/src/tensorrt/bin/trtexec "xxxxxx"  --verbose --dumpLayerInfo --exportLayerInfo=layer.json > log.log 2>&1

Not sure if this problem is hardware related, I can use AGX Orin and it works fine, can you try to switch the jetson profile, such as maxn

Or try to add the --workspace parameter

1 Like

The sample application I mentioned above uses the latest OCD/OCR model. So I think the model you mentioned should work.

1 Like

Thank you once more for your time.

I followed your instructions and tried generating the engine again with the logs.
Here are the log files:
log.log (3.3 MB)

--workspace=8192

workspace8192log.log (3.3 MB)

--workspace=4096
workspace4096log.log (3.3 MB)

I also tried onnxsim to simplify the model and then generate the engine. I still get stuck at the same step.
onnxsimlog.log (2.6 MB)

I used the model from the sample: ‘ocdnet_fan_tiny_2x_icdar_pruned.onnx’ with
wget --content-disposition 'https://api.ngc.nvidia.com/v2/models/org/nvidia/team/tao/ocdnet/deployable_v2.3/files?redirect=true&path=ocdnet_fan_tiny_2x_icdar_pruned.onnx' -O ocdnet.onnx

As you can see in the log files, each file stops at the same step when the process is killed.

As mentioned before, I have this problem only with the v2.x (ViT) versions of the OCD models. Seems like that particular step kills the process immediately. Is there any hope I will be able to make it work?

Thank you in advance for your help!

Not sure if it’s because the memory of orin nx is too small.

Have you tried setting profie to maxn?

Or you can get more explanation here TensorRT - NVIDIA Developer Forums

1 Like

Yes, I forgot to mention but the jeston orin is set to MAXN and Jetson clocks are running.

Okay, thank you for your help. I will post a question on their forum.

A small update:

I managed to generate the engine for the ViT OCDnet successfully on an AGX Orin inside a DS 7.0 container.
On Optical Character Detection | NVIDIA NGC there are performance results for the model on Orin NX.

ocdnet_fan_tiny_2x_icdar_pruned | Orin NX | FP16 | 2 | 1.18

I assume it is possible and it was done before. I retried with a fresh docker container and still encounter process killed at the exact same step. If I find a solution, I will submit it here.

1 Like

I managed to generate the TRT engine for ocdnet_fan_tiny_2x_icdar_pruned model on Orin NX by changing the swap memory to 16GB. So I had 32GB in total. It took some time to generate, but in the end I had the result.

Thank you again for your time, @junshengy .

1 Like

Thanks for your sharing. This can help more people

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.