Application Migration from Jetson Orin NX(16G) to Jetson Orin NX Super(16G)

Platform:
Machine: aarch64
System: Linux
Python: 3.10.12
Jetpack: 6.2
Deepstream:7.1
Libraries
CUDA: 12.6.68
cuDNN: 9.3.0.75
TensorRT: 10.3.0.30
OpenCV: 4.10.0 with CUDA: YES

I have implemented a deepstream based software on the Jetson NX platform, which can run normally. When I ported it to the NX Super platform.

  1. I refer to this link to replace the whole “sources/apps”
  2. I refer to this link to recompile the dependency library “nvdsinfer_custom_impl_Yolo.so”
    When I was converting the .pt model to an engine by using
    “/usr/src/tensorrt/bin/trtexec
    –onnx=model_jetson_pgie_n_20250226.onnx
    –saveEngine=./model_jetson_pgie_n_20250411_fp16_batch4.engine
    –memPoolSize=workspace:4096MiB
    –fp16”
    on the super platform, I noticed an issue with the workspace(workspace_small.png) which should like this pic(workspace_normal.png) during the conversion process .And it led the issue “Some tactics do not have sufficient workspace memory to run. Increasing workspace size will enable more tactics, please check verbose output for requested sizes.” :

so I remove the “–memPoolSize=workspace:4096MiB” in order to it can conversion normally.

  1. After use the model and lib, running the program (similar to “deeps tram_python_apps” DeepStream-test2), the initial display of the loaded model is as follows of super70.txt :—start----
    After running for a period of time, the error is as follows of super70.txt :—error down—
    here is super70.txt (running in super) and normal_out.txt(running in orin nx):
    super70.txt (6.0 KB)
    normal_out.txt (5.3 KB)

My Question is:

  • My operation should be standardized and there should be no problem, right?
  • How to solve the workspace issue raised in the second point? Did it cause CUDAMem error after running for a period of time?
  • If the error is not related to the workspace issue during model conversion, how to solve the CUDA problem encountered on the super platform?

Hi,
Do you re-flash the whole system to r36.4.3 to enable super mode? The mode is enabled with update in some config files and device tree. It is supposed to have no impact to userspace application.

Hi,

Please try to add one more config to allow TensorRT to spend more building time for more optimization options.

$ /usr/src/tensorrt/bin/trtexec --builderOptimizationLevel=5 ...

Thanks.

I’m not quite sure what you mean by ‘r36.4.3’, but I followed this link to flash it.


And it do have super mode.

&&&& RUNNING TensorRT.trtexec [TensorRT v100300] # /usr/src/tensorrt/bin/trtexec --builderOptimizationLevel=5 --onnx=model_jetson_pgie_n_20250226.onnx --saveEngine=./model_jetson_pgie_n_20250411_fp16_batch4.engine --memPoolSize=workspace:4096MiB --fp16
[04/14/2025-10:11:07] [I] === Model Options ===
[04/14/2025-10:11:07] [I] Format: ONNX
[04/14/2025-10:11:07] [I] Model: model_jetson_pgie_n_20250226.onnx
[04/14/2025-10:11:07] [I] Output:
[04/14/2025-10:11:07] [I] === Build Options ===
[04/14/2025-10:11:07] [I] Memory Pools: workspace: 0.00390625 MiB, dlaSRAM: default, dlaLocalDRAM: default, dlaGlobalDRAM: default, tacticSharedMem: default
[04/14/2025-10:11:07] [I] avgTiming: 8
[04/14/2025-10:11:07] [I] Precision: FP32+FP16
[04/14/2025-10:11:07] [I] LayerPrecisions: 
[04/14/2025-10:11:07] [I] Layer Device Types: 
[04/14/2025-10:11:07] [I] Calibration: 
[04/14/2025-10:11:07] [I] Refit: Disabled
[04/14/2025-10:11:07] [I] Strip weights: Disabled
[04/14/2025-10:11:07] [I] Version Compatible: Disabled
[04/14/2025-10:11:07] [I] ONNX Plugin InstanceNorm: Disabled
[04/14/2025-10:11:07] [I] TensorRT runtime: full
[04/14/2025-10:11:07] [I] Lean DLL Path: 
[04/14/2025-10:11:07] [I] Tempfile Controls: { in_memory: allow, temporary: allow }
[04/14/2025-10:11:07] [I] Exclude Lean Runtime: Disabled
[04/14/2025-10:11:07] [I] Sparsity: Disabled
[04/14/2025-10:11:07] [I] Safe mode: Disabled
[04/14/2025-10:11:07] [I] Build DLA standalone loadable: Disabled
[04/14/2025-10:11:07] [I] Allow GPU fallback for DLA: Disabled
[04/14/2025-10:11:07] [I] DirectIO mode: Disabled
[04/14/2025-10:11:07] [I] Restricted mode: Disabled
[04/14/2025-10:11:07] [I] Skip inference: Disabled
[04/14/2025-10:11:07] [I] Save engine: ./model_jetson_pgie_n_20250411_fp16_batch4.engine
[04/14/2025-10:11:07] [I] Load engine: 
[04/14/2025-10:11:07] [I] Profiling verbosity: 0
[04/14/2025-10:11:07] [I] Tactic sources: Using default tactic sources
[04/14/2025-10:11:07] [I] timingCacheMode: local
[04/14/2025-10:11:07] [I] timingCacheFile: 
[04/14/2025-10:11:07] [I] Enable Compilation Cache: Enabled
[04/14/2025-10:11:07] [I] errorOnTimingCacheMiss: Disabled
[04/14/2025-10:11:07] [I] Preview Features: Use default preview flags.
[04/14/2025-10:11:07] [I] MaxAuxStreams: -1
[04/14/2025-10:11:07] [I] BuilderOptimizationLevel: 5
[04/14/2025-10:11:07] [I] Calibration Profile Index: 0
[04/14/2025-10:11:07] [I] Weight Streaming: Disabled
[04/14/2025-10:11:07] [I] Runtime Platform: Same As Build
[04/14/2025-10:11:07] [I] Debug Tensors: 
[04/14/2025-10:11:07] [I] Input(s)s format: fp32:CHW
[04/14/2025-10:11:07] [I] Output(s)s format: fp32:CHW
[04/14/2025-10:11:07] [I] Input build shapes: model
[04/14/2025-10:11:07] [I] Input calibration shapes: model
[04/14/2025-10:11:07] [I] === System Options ===
[04/14/2025-10:11:07] [I] Device: 0
[04/14/2025-10:11:07] [I] DLACore: 
[04/14/2025-10:11:07] [I] Plugins:
[04/14/2025-10:11:07] [I] setPluginsToSerialize:
[04/14/2025-10:11:07] [I] dynamicPlugins:
[04/14/2025-10:11:07] [I] ignoreParsedPluginLibs: 0
[04/14/2025-10:11:07] [I] 
[04/14/2025-10:11:07] [I] === Inference Options ===
[04/14/2025-10:11:07] [I] Batch: Explicit
[04/14/2025-10:11:07] [I] Input inference shapes: model
[04/14/2025-10:11:07] [I] Iterations: 10
[04/14/2025-10:11:07] [I] Duration: 3s (+ 200ms warm up)
[04/14/2025-10:11:07] [I] Sleep time: 0ms
[04/14/2025-10:11:07] [I] Idle time: 0ms
[04/14/2025-10:11:07] [I] Inference Streams: 1
[04/14/2025-10:11:07] [I] ExposeDMA: Disabled
[04/14/2025-10:11:07] [I] Data transfers: Enabled
[04/14/2025-10:11:07] [I] Spin-wait: Disabled
[04/14/2025-10:11:07] [I] Multithreading: Disabled
[04/14/2025-10:11:07] [I] CUDA Graph: Disabled
[04/14/2025-10:11:07] [I] Separate profiling: Disabled
[04/14/2025-10:11:07] [I] Time Deserialize: Disabled
[04/14/2025-10:11:07] [I] Time Refit: Disabled
[04/14/2025-10:11:07] [I] NVTX verbosity: 0
[04/14/2025-10:11:07] [I] Persistent Cache Ratio: 0
[04/14/2025-10:11:07] [I] Optimization Profile Index: 0
[04/14/2025-10:11:07] [I] Weight Streaming Budget: 100.000000%
[04/14/2025-10:11:07] [I] Inputs:
[04/14/2025-10:11:07] [I] Debug Tensor Save Destinations:
[04/14/2025-10:11:07] [I] === Reporting Options ===
[04/14/2025-10:11:07] [I] Verbose: Disabled
[04/14/2025-10:11:07] [I] Averages: 10 inferences
[04/14/2025-10:11:07] [I] Percentiles: 90,95,99
[04/14/2025-10:11:07] [I] Dump refittable layers:Disabled
[04/14/2025-10:11:07] [I] Dump output: Disabled
[04/14/2025-10:11:07] [I] Profile: Disabled
[04/14/2025-10:11:07] [I] Export timing to JSON file: 
[04/14/2025-10:11:07] [I] Export output to JSON file: 
[04/14/2025-10:11:07] [I] Export profile to JSON file: 
[04/14/2025-10:11:07] [I] 
[04/14/2025-10:11:07] [I] === Device Information ===
[04/14/2025-10:11:07] [I] Available Devices: 
[04/14/2025-10:11:07] [I]   Device 0: "Orin" UUID: GPU-10bbbeac-937e-5daa-9911-3c1c1a2fde5f
[04/14/2025-10:11:07] [I] Selected Device: Orin
[04/14/2025-10:11:07] [I] Selected Device ID: 0
[04/14/2025-10:11:07] [I] Selected Device UUID: GPU-10bbbeac-937e-5daa-9911-3c1c1a2fde5f
[04/14/2025-10:11:07] [I] Compute Capability: 8.7
[04/14/2025-10:11:07] [I] SMs: 8
[04/14/2025-10:11:07] [I] Device Global Memory: 15655 MiB
[04/14/2025-10:11:07] [I] Shared Memory per SM: 164 KiB
[04/14/2025-10:11:07] [I] Memory Bus Width: 256 bits (ECC disabled)
[04/14/2025-10:11:07] [I] Application Compute Clock Rate: 1.173 GHz
[04/14/2025-10:11:07] [I] Application Memory Clock Rate: 1.173 GHz
[04/14/2025-10:11:07] [I] 
[04/14/2025-10:11:07] [I] Note: The application clock rates do not reflect the actual clock rates that the GPU is currently running at.
[04/14/2025-10:11:07] [I] 
[04/14/2025-10:11:07] [I] TensorRT version: 10.3.0
[04/14/2025-10:11:07] [I] Loading standard plugins
[04/14/2025-10:11:07] [I] [TRT] [MemUsageChange] Init CUDA: CPU +2, GPU +0, now: CPU 31, GPU 2320 (MiB)
[04/14/2025-10:11:10] [I] [TRT] [MemUsageChange] Init builder kernel library: CPU +928, GPU +1092, now: CPU 1002, GPU 3456 (MiB)
[04/14/2025-10:11:10] [I] Start parsing network model.
[04/14/2025-10:11:10] [I] [TRT] ----------------------------------------------------------------
[04/14/2025-10:11:10] [I] [TRT] Input filename:   model_jetson_pgie_n_20250226.onnx
[04/14/2025-10:11:10] [I] [TRT] ONNX IR version:  0.0.7
[04/14/2025-10:11:10] [I] [TRT] Opset version:    12
[04/14/2025-10:11:10] [I] [TRT] Producer name:    pytorch
[04/14/2025-10:11:10] [I] [TRT] Producer version: 2.6.0
[04/14/2025-10:11:10] [I] [TRT] Domain:           
[04/14/2025-10:11:10] [I] [TRT] Model version:    0
[04/14/2025-10:11:10] [I] [TRT] Doc string:       
[04/14/2025-10:11:10] [I] [TRT] ----------------------------------------------------------------
[04/14/2025-10:11:10] [I] Finished parsing network model. Parse time: 0.0496349
[04/14/2025-10:11:10] [I] [TRT] Local timing cache in use. Profiling results in this builder pass will not be stored.
[04/14/2025-10:11:50] [I] [TRT] Some tactics do not have sufficient workspace memory to run. Increasing workspace size will enable more tactics, please check verbose output for requested sizes.
[04/14/2025-10:17:26] [W] [TRT] Engine generation failed with backend strategy 4.
Error message: [optimizer.cpp::computeCosts::4148] Error Code 4: Internal Error (Could not find any implementation for node /1/ArgMax due to insufficient workspace. See verbose log for requested sizes.).
Skipping this backend strategy.
[04/14/2025-10:17:26] [I] [TRT] Local timing cache in use. Profiling results in this builder pass will not be stored.
[04/14/2025-10:17:44] [I] [TRT] Some tactics do not have sufficient workspace memory to run. Increasing workspace size will enable more tactics, please check verbose output for requested sizes.
[04/14/2025-10:19:29] [W] [TRT] UNSUPPORTED_STATE: Skipping tactic 0 due to insufficient memory on requested size of 8601600 detected for tactic 0x0000000000000000.
[04/14/2025-10:19:29] [W] [TRT] UNSUPPORTED_STATE: Skipping tactic 0 due to insufficient memory on requested size of 4300800 detected for tactic 0x0000000000000000.
[04/14/2025-10:19:29] [W] [TRT] UNSUPPORTED_STATE: Skipping tactic 0 due to insufficient memory on requested size of 2150400 detected for tactic 0x0000000000000000.
[04/14/2025-10:19:29] [W] [TRT] Engine generation failed with backend strategy 3.
Error message: [optimizer.cpp::computeCosts::4148] Error Code 4: Internal Error (Could not find any implementation for node {ForeignNode[/0/model.22/Concat...ONNXTRT_ShapeShuffle_30 + /0/model.22/dfl/Transpose_1]} due to insufficient workspace. See verbose log for requested sizes.).
Skipping this backend strategy.
[04/14/2025-10:19:29] [I] [TRT] Local timing cache in use. Profiling results in this builder pass will not be stored.
[04/14/2025-10:19:51] [I] [TRT] Some tactics do not have sufficient workspace memory to run. Increasing workspace size will enable more tactics, please check verbose output for requested sizes.
[04/14/2025-10:23:22] [W] [TRT] UNSUPPORTED_STATE: Skipping tactic 0 due to insufficient memory on requested size of 403200 detected for tactic 0x0000000000000000.
[04/14/2025-10:23:22] [W] [TRT] UNSUPPORTED_STATE: Skipping tactic 0 due to insufficient memory on requested size of 235200 detected for tactic 0x0000000000000000.
[04/14/2025-10:23:22] [W] [TRT] UNSUPPORTED_STATE: Skipping tactic 0 due to insufficient memory on requested size of 235200 detected for tactic 0x0000000000000000.
[04/14/2025-10:23:22] [W] [TRT] Engine generation failed with backend strategy 2.
Error message: [optimizer.cpp::computeCosts::4148] Error Code 4: Internal Error (Could not find any implementation for node {ForeignNode[/0/model.22/Split_1_27.../1/Slice]} due to insufficient workspace. See verbose log for requested sizes.).
Skipping this backend strategy.
[04/14/2025-10:23:22] [E] Error[2]: [engineBuilder.cpp::makeEngineFromSubGraph::1879] Error Code 2: Internal Error (Engine generation failed because all backend strategies failed.)
[04/14/2025-10:23:22] [E] Engine could not be created from network
[04/14/2025-10:23:22] [E] Building engine failed
[04/14/2025-10:23:22] [E] Failed to create engine from model or file.
[04/14/2025-10:23:22] [E] Engine set up failed
&&&& FAILED TensorRT.trtexec [TensorRT v100300] # /usr/src/tensorrt/bin/trtexec --builderOptimizationLevel=5 --onnx=model_jetson_pgie_n_20250226.onnx --saveEngine=./model_jetson_pgie_n_20250411_fp16_batch4.engine --memPoolSize=workspace:4096MiB --fp16

Hi,

Just try it with a built-in model, we can set the workspace to 4096 without issue.
(using 4096 instead of 4096MiB)
Please give it a try.

$ /usr/src/tensorrt/bin/trtexec --onnx=/usr/src/tensorrt/data/mnist/mnist.onnx --memPoolSize=workspace:4096
&&&& RUNNING TensorRT.trtexec [TensorRT v100300] # /usr/src/tensorrt/bin/trtexec --onnx=/usr/src/tensorrt/data/mnist/mnist.onnx --memPoolSize=workspace:4096
[04/14/2025-02:52:50] [I] === Model Options ===
[04/14/2025-02:52:50] [I] Format: ONNX
[04/14/2025-02:52:50] [I] Model: /usr/src/tensorrt/data/mnist/mnist.onnx
[04/14/2025-02:52:50] [I] Output:
[04/14/2025-02:52:50] [I] === Build Options ===
[04/14/2025-02:52:50] [I] Memory Pools: workspace: 4096 MiB, dlaSRAM: default, dlaLocalDRAM: default, dlaGlobalDRAM: default, tacticSharedMem: default
[04/14/2025-02:52:50] [I] avgTiming: 8
[04/14/2025-02:52:50] [I] Precision: FP32
[04/14/2025-02:52:50] [I] LayerPrecisions: 
[04/14/2025-02:52:50] [I] Layer Device Types: 
...

Thanks.

Luckly, it works. I will use that model try to verify the software running problems.
^_^ thank you.

Do we have a method to compare two models? I wonder why my new model cannot be used normally (the bounding box is not displayed), while the old model can display it. It’s amazing that both the new and old engine models come from the same .pt model.

  • My operation should be standardized and there should be no problem, right?

Hi,

Do you use the same “ONNX” file to convert the TensorRT engine?
If not, could you give it a try as this will reduce one of the differences?

Thanks.

right, when I use old onnx file to conversion as you recommend, it can display the bounding box. but still when it run around 10 minutes, window display will stop with exception and the terminal output:

GPUassert: an illegal memory access was encountered /dvs/git/dirty/git-master_linux/deepstream/sdk/src/utils/nvmultiobjecttracker/src/modules/cuDCFv2/cuDCFFrameTransformTexture.cu 693
0:10:40.834126732 15627 0xaaaad3a8a240 ERROR nvinfer gstnvinfer.cpp:1267:get_converted_buffer: cudaMemset2DAsync failed with error cudaErrorIllegalAddress while converting buffer
0:10:40.834163060 15627 0xaaaad3a89c60 ERROR nvinfer gstnvinfer.cpp:1225:get_converted_buffer: cudaMemset2DAsync failed with error cudaErrorIllegalAddress while converting buffer
0:10:40.834212318 15627 0xaaaad3a89c60 WARN nvinfer gstnvinfer.cpp:1894:gst_nvinfer_process_objects: error: Buffer conversion failed
ERROR: Failed to add cudaStream callback for returning input buffers, cuda err_no:700, err_str:cudaErrorIllegalAddress
ERROR: Preprocessor transform input data failed., nvinfer error:NVDSINFER_CUDA_ERROR
0:10:40.834189754 15627 0xaaaad3a8a240 WARN nvinfer gstnvinfer.cpp:1576:gst_nvinfer_process_full_frame: error: Buffer conversion failed
0:10:40.834359582 15627 0xaaaad3a89c00 WARN nvinfer gstnvinfer.cpp:1420:gst_nvinfer_input_queue_loop: error: Failed to queue input batch for inferencing
0:10:40.834324726 15627 0xaaaad3a89c60 ERROR nvinfer gstnvinfer.cpp:1225:get_converted_buffer: cudaMemset2DAsync failed with error cudaErrorIllegalAddress while converting buffer
0:10:40.834432046 15627 0xaaaad3a89c60 WARN nvinfer gstnvinfer.cpp:1894:gst_nvinfer_process_objects: error: Buffer conversion failed

!![Exception] GPUassert failed
An exception occurred. GPUassert failed
gstnvtracker: Low-level tracker lib returned error 1
/dvs/git/dirty/git-master_linux/nvutils/nvbufsurftransform/nvbufsurftransform_copy.cpp:552: => Failed in mem copy

Linux-For-Tegra version - revision 36.4.3

sorry, kind of forgot, after I check, I do re-flash system to r36.4.3 by using following things:

1 Like

Hi,

The nvbufsurftransform_copy reports “Failed in mem copy” is a known issue and is fixed in the upcoming Deepstream release.
Please find below comment for more info:

Thanks.

Do you mean this issue is normal in current version of Jetpack6.2 with deepstream-7.1.
I am curious why Jetpack6.0 with Deepstream-7.0 seems not have this question. And if I decrease the ai video channel from 4 to 2, it seems solved.(I say this because the 2 channel video has been running normally since 6pm yesterday until now)

Hi,

This is a known issue as it has been reported by other users before.
There is a WAR shared in the above link. Could you give it a try to see if it can also help your issue?

nvvideoconvert compute-hw=1 nvbuf-memory-type=3

Thanks.