Trtexec model conversion crashed at insufficient gpu memory

Hi, there:

I’m trying to use trtexec tool on Orin32 for a custom cnn model. If I let it run on one AI core, it works fine. However, if I try to run it in parallel, it will complain not enough GPU memory.

We have an Orin64 board also, and it works just fine. Any way to make Orin32 works also?

It crashes at node772 which is convolution with 4x46x1x1 kernel. I guess I can’t share the model here.

Any suggestion?

Many thanks!

Crash log below.

[10/26/2022-17:55:35] [W] [TRT] -------------- The current device memory allocations dump as below --------------
[10/26/2022-17:55:35] [W] [TRT] [0]:89088 :CASK device reserved buffer in initCaskDeviceBuffer: at /_src/build/aarch64-gnu/release/runtime/gpu/cask/caskUtils.h: 459 idx: 244093 time: 0.00989673
[10/26/2022-17:55:35] [W] [TRT] Requested amount of GPU memory (89088 bytes) could not be allocated. There may not be enough free memory for allocation to succeed.
[10/26/2022-17:55:35] [W] [TRT] Skipping tactic 21 due to insufficient memory on requested size of 89088 detected for tactic 0xff4d370e229c1e8e.
Try decreasing the workspace size with IBuilderConfig::setMemoryPoolLimit().
[10/26/2022-17:55:35] [E] Error[10]: [optimizer.cpp::computeCosts::3628] Error Code 10: Internal Error (Could not find any implementation for node node_of_772.)
[10/26/2022-17:55:35] [E] Error[2]: [builder.cpp::buildSerializedNetwork::636] Error Code 2: Internal Error (Assertion engine != nullptr failed. )
[10/26/2022-17:55:35] [E] Engine could not be created from network
[10/26/2022-17:55:35] [E] Building engine failed
[10/26/2022-17:55:35] [E] Failed to create engine from model or file.
[10/26/2022-17:55:35] [E] Engine set up failed
&&&& FAILED TensorRT.trtexec [TensorRT v8401] # /usr/src/tensorrt/bin/trtexec --onnx=c3dv2.3.k4.onnx --saveEngine=c3d_best.engine --best --allowGPUFallback

Hi,

We are moving this post to the Jetson Orin NX forum to get better help.

Please let us know if you’re facing this issue on a different hardware.

Thank you.

Hi,

Could you set the --memPoolSize variable according to your environment to see if it helps?

Thanks.

I said in the original post, Orin64 works fine, which has 64GB cpu mem. Orin 32 has 32GB.

I’m using Orin32. How much shall I set? How to find out the limit?

Hi,

You can check the system status with the tegrastats.

$ sudo tegrastats

For example, in our environment, there is ~28GiB free memory.
So we can set the memPool with 24GiB (depending on your budget).

11-15-2022 03:01:43 RAM 1178/30536MB (lfb 6990x4MB) SWAP 0/15268MB (cached 0MB) CPU [0%@729,0%@729,0%@729,0%@729,off,off,off,off] …

Thanks.

I use following mem option and it complains the format wrong:
/usr/src/tensorrt/bin/trtexec --onnx=c3dv2.3.k4.onnx --saveEngine=c3d_best.engine --best --allowGPUFallback --memPoolSize=dlaGobalDram:26000

Do you have sample command options? The command help is not clear. Thanks.

Here is my tegrastats output:
tegrastats
11-16-2022 16:57:52 RAM 2633/30536MB (lfb 6525x4MB) SWAP 0/15268MB (cached 0MB) CPU [0%@729,0%@729,0%@729,0%@729,0%@729,0%@729,0%@729,0%@729,5%@729,0%@729,0%@729,0%@729] EMC_FREQ 0% GR3D_FREQ 0% CV0@-256C CPU@44.718C Tboard@33C SOC2@40.718C Tdiode@35C SOC0@41.625C CV1@-256C GPU@-256C tj@44.625C SOC1@40.593C CV2@-256C
11-16-2022 16:57:53 RAM 2633/30536MB (lfb 6525x4MB) SWAP 0/15268MB (cached 0MB) CPU [1%@729,0%@729,0%@729,2%@729,0%@729,0%@729,0%@729,0%@729,5%@729,0%@729,0%@729,0%@729] EMC_FREQ 0% GR3D_FREQ 0% CV0@-256C CPU@45.062C Tboard@33C SOC2@40.812C Tdiode@35C SOC0@41.593C CV1@-256C GPU@-256C tj@45.062C SOC1@40.281C CV2@-256C

Hi,

Not DLA, please set the workspace memory instead.
More, the amount looks incorrect. Do you want to set 26GiB?

For example, we can run ResNet with the below command:

$ /usr/src/tensorrt/bin/trtexec --onnx=/usr/src/tensorrt/data/resnet50/ResNet50.onnx --saveEngine=ResNet50.engine --best --allowGPUFallback --memPoolSize=workspace:2600

Thanks.

I tried the approach. It still crashed . At the time of crash, the tegrastats shows the mem usage very low, only around 8GB.

I set the workspace to 2600, 2700, 26000, 27000, all the same results. I rebooted the orin box after each try.

BTW your unit is MB, so 26 GB should be 26000, not 2600. Anyway I tried both them all.

I attached the results for both 2600 and 26000, exactly the same error.

======= workspace:2600 ============================================

11-17-2022 14:22:29 RAM 8116/30536MB (lfb 4915x4MB) SWAP 0/15268MB (cached 0MB) CPU [100%@2201,0%@2201,0%@2201,0%@2201,3%@729,24%@729,22%@729,18%@729,60%@2201,2%@2201,0%@2201,0%@2201] EMC_FREQ 0% GR3D_FREQ 0% CV0@-256C CPU@49.5C Tboard@37C SOC2@45.093C Tdiode@39C SOC0@45.562C CV1@-256C GPU@44.375C tj@49.5C SOC1@44.218C CV2@-256C
11-17-2022 14:22:30 RAM 8116/30536MB (lfb 4915x4MB) SWAP 0/15268MB (cached 0MB) CPU [100%@2201,0%@2201,0%@2201,0%@2201,4%@729,19%@729,25%@729,19%@729,55%@2201,8%@2201,0%@2201,0%@2201] EMC_FREQ 0% GR3D_FREQ 0% CV0@-256C CPU@50.125C Tboard@37C SOC2@44.968C Tdiode@39.25C SOC0@45.562C CV1@-256C GPU@44.031C tj@50.125C SOC1@44.187C CV2@-256C
11-17-2022 14:22:31 RAM 8116/30536MB (lfb 4915x4MB) SWAP 0/15268MB (cached 0MB) CPU [100%@2201,0%@2201,0%@2201,0%@2201,7%@2201,22%@2201,20%@2201,50%@2201,9%@729,2%@729,0%@729,0%@729] EMC_FREQ 0% GR3D_FREQ 0% CV0@-256C CPU@50.343C Tboard@37C SOC2@45.093C Tdiode@39C SOC0@45.531C CV1@-256C GPU@44.062C tj@50.343C SOC1@44.218C CV2@-256C
11-17-2022 14:22:32 RAM 6142/30536MB (lfb 5180x4MB) SWAP 0/15268MB (cached 0MB) CPU [75%@2201,0%@2201,0%@2201,18%@2201,7%@729,14%@729,23%@729,11%@729,0%@729,0%@729,0%@729,0%@729] EMC_FREQ 0% GR3D_FREQ 5% CV0@-256C CPU@49.875C Tboard@37C SOC2@44.812C Tdiode@39C SOC0@45.625C CV1@-256C GPU@44.125C tj@49.281C SOC1@44.218C CV2@-256C
11-17-2022 14:22:33 RAM 4898/30536MB (lfb 5408x4MB) SWAP 0/15268MB (cached 0MB) CPU [0%@729,0%@729,0%@729,6%@729,0%@729,0%@729,0%@729,0%@729,0%@729,0%@729,0%@729,0%@729] EMC_FREQ 0% GR3D_FREQ 0% CV0@-256C CPU@48.5C Tboard@37C SOC2@44.437C Tdiode@39C SOC0@45.718C CV1@-256C GPU@-256C tj@48.906C SOC1@44.218C CV2@-256C

=== trtexec log ===
[11/17/2022-14:22:31] [W] [TRT] [0xaaab3ff24180]:151 :ScratchObject in storeCachedObject: at optimizer/gpu/cudnn/convolutionBuilder.cpp: 170 idx: 735118 time: 1.28e-07
[11/17/2022-14:22:31] [W] [TRT] [0xaaaaf570aef0]:4 :: weight zero-point in internalAllocate: at runtime/common/weightsPtr.cpp: 102 idx: 155 time: 1.6e-07
[11/17/2022-14:22:31] [W] [TRT] [0xaaab3fe4c9c0]:151 :ScratchObject in storeCachedObject: at optimizer/gpu/cudnn/convolutionBuilder.cpp: 170 idx: 563346 time: 9.6e-08
[11/17/2022-14:22:31] [W] [TRT] [0xaaab41d09900]:151 :ScratchObject in storeCachedObject: at optimizer/gpu/cudnn/convolutionBuilder.cpp: 170 idx: 735130 time: 2.56e-07
[11/17/2022-14:22:31] [W] [TRT] [0xaaab41cf9ff0]:151 :ScratchObject in storeCachedObject: at optimizer/gpu/cudnn/convolutionBuilder.cpp: 170 idx: 735142 time: 6.4e-08
[11/17/2022-14:22:31] [W] [TRT] -------------- The current device memory allocations dump as below --------------
[11/17/2022-14:22:31] [W] [TRT] [0]:89088 :CASK device reserved buffer in initCaskDeviceBuffer: at /_src/build/aarch64-gnu/release/runtime/gpu/cask/caskUtils.h: 459 idx: 244073 time: 0.0101787
[11/17/2022-14:22:31] [W] [TRT] Requested amount of GPU memory (89088 bytes) could not be allocated. There may not be enough free memory for allocation to succeed.
[11/17/2022-14:22:31] [W] [TRT] Skipping tactic 21 due to insufficient memory on requested size of 89088 detected for tactic 0xff4d370e229c1e8e.
[11/17/2022-14:22:31] [E] Error[4]: [optimizer.cpp::computeCosts::3625] Error Code 4: Internal Error (Could not find any implementation for node node_of_772 due to insufficient workspace. See verbose log for requested sizes.)
[11/17/2022-14:22:31] [E] Error[2]: [builder.cpp::buildSerializedNetwork::636] Error Code 2: Internal Error (Assertion engine != nullptr failed. )
[11/17/2022-14:22:31] [E] Engine could not be created from network
[11/17/2022-14:22:31] [E] Building engine failed
[11/17/2022-14:22:31] [E] Failed to create engine from model or file.
[11/17/2022-14:22:31] [E] Engine set up failed
&&&& FAILED TensorRT.trtexec [TensorRT v8401] # /usr/src/tensorrt/bin/trtexec --onnx=c3dv2.3.k4.onnx --saveEngine=c3d_best.engine --best --allowGPUFallback --memPoolSize=workspace:2600

======workspace:26000 ============================
=== at/after crash, mem diff is only 3G+ ===
11-17-2022 14:06:13 RAM 8131/30536MB (lfb 4884x4MB) SWAP 0/15268MB (cached 0MB) CPU [100%@2201,0%@2201,0%@2201,0%@2201,26%@729,2%@729,22%@729,18%@729,9%@2201,48%@2201,3%@2201,0%@2201] EMC_FREQ 0% GR3D_FREQ 0% CV0@-256C CPU@49.531C Tboard@37C SOC2@44.906C Tdiode@38.75C SOC0@45.718C CV1@-256C GPU@44.156C tj@49.75C SOC1@44.187C CV2@-256C
11-17-2022 14:06:14 RAM 8130/30536MB (lfb 4884x4MB) SWAP 0/15268MB (cached 0MB) CPU [96%@2201,0%@2201,0%@2201,0%@2201,19%@729,10%@729,10%@729,19%@729,0%@729,48%@729,5%@729,0%@729] EMC_FREQ 0% GR3D_FREQ 0% CV0@-256C CPU@49.218C Tboard@37C SOC2@44.687C Tdiode@39C SOC0@45.656C CV1@-256C GPU@44.187C tj@49.218C SOC1@44.218C CV2@-256C
11-17-2022 14:06:15 RAM 4905/30536MB (lfb 5358x4MB) SWAP 0/15268MB (cached 0MB) CPU [8%@729,0%@729,0%@729,26%@729,0%@729,0%@729,0%@729,0%@729,0%@729,0%@729,0%@729,0%@729] EMC_FREQ 0% GR3D_FREQ 0% CV0@-256C CPU@48.531C Tboard@37C SOC2@44.468C Tdiode@39C SOC0@45.437C CV1@-256C GPU@-256C tj@48.812C SOC1@44.281C CV2@-256C
11-17-2022 14:06:16 RAM 4905/30536MB (lfb 5358x4MB) SWAP 0/15268MB (cached 0MB) CPU [0%@729,0%@729,0%@729,0%@729,0%@729,0%@729,0%@729,0%@729,0%@729,0%@729,0%@729,0%@729] EMC_FREQ 0% GR3D_FREQ 0% CV0@-256C CPU@48.562C Tboard@37C SOC2@44.343C Tdiode@39C SOC0@45.375C CV1@-256C GPU@-256C tj@48.562C SOC1@44.281C CV2@-256C

=== trtexec log ===
11/17/2022-14:06:14] [W] [TRT] [0xaaab5c6c46d0]:151 :ScratchObject in storeCachedObject: at optimizer/gpu/cudnn/convolutionBuilder.cpp: 170 idx: 735130 time: 2.56e-07
[11/17/2022-14:06:14] [W] [TRT] [0xaaab5b03e980]:151 :ScratchObject in storeCachedObject: at optimizer/gpu/cudnn/convolutionBuilder.cpp: 170 idx: 735139 time: 1.28e-07
[11/17/2022-14:06:14] [W] [TRT] [0xaaab0f29c220]:4 :: weight zero-point in internalAllocate: at runtime/common/weightsPtr.cpp: 102 idx: 727 time: 2.56e-07
[11/17/2022-14:06:14] [W] [TRT] [0xaaab56e3e410]:151 :ScratchObject in storeCachedObject: at optimizer/gpu/cudnn/convolutionBuilder.cpp: 170 idx: 735142 time: 6.4e-08
[11/17/2022-14:06:14] [W] [TRT] [0xaaab5afc46b0]:151 :ScratchObject in storeCachedObject: at optimizer/gpu/cudnn/convolutionBuilder.cpp: 170 idx: 499319 time: 1.92e-07
[11/17/2022-14:06:14] [W] [TRT] -------------- The current device memory allocations dump as below --------------
[11/17/2022-14:06:14] [W] [TRT] [0]:89088 :CASK device reserved buffer in initCaskDeviceBuffer: at /_src/build/aarch64-gnu/release/runtime/gpu/cask/caskUtils.h: 459 idx: 244093 time: 0.0098853
[11/17/2022-14:06:14] [W] [TRT] Requested amount of GPU memory (89088 bytes) could not be allocated. There may not be enough free memory for allocation to succeed.
[11/17/2022-14:06:14] [W] [TRT] Skipping tactic 21 due to insufficient memory on requested size of 89088 detected for tactic 0xff4d370e229c1e8e.
[11/17/2022-14:06:14] [E] Error[10]: [optimizer.cpp::computeCosts::3628] Error Code 10: Internal Error (Could not find any implementation for node node_of_772.)
[11/17/2022-14:06:14] [E] Error[2]: [builder.cpp::buildSerializedNetwork::636] Error Code 2: Internal Error (Assertion engine != nullptr failed. )
[11/17/2022-14:06:14] [E] Engine could not be created from network
[11/17/2022-14:06:14] [E] Building engine failed
[11/17/2022-14:06:14] [E] Failed to create engine from model or file.
[11/17/2022-14:06:14] [E] Engine set up failed
&&&& FAILED TensorRT.trtexec [TensorRT v8401] # /usr/src/tensorrt/bin/trtexec --onnx=c3dv2.3.k4.onnx --saveEngine=c3d_best.engine --best --allowGPUFallback --memPoolSize=workspace:26000

Hi,

Sorry for the incorrect message about the amount.
I have updated the comment above.

Could you share the model with us so we can test it in our environment as well?
Thanks.

Sorry, I can’t give out the model. It just has lots of cnn layers. For the last few layers, there are near 30 parallel conv layers, and then concat the results.

Hi,

Just want to clarify the issue first.

There is only one model but the model contains several layers that are expected to run parallel.
Is that correct? Or you have multiple models that want to run concurrently?

To get more information about the crash, could you add the --verbose for both Orin 32GB and Orin 64GB and attach the output?
We want to compare the success and failure logs to find the possible cause.

Thanks.

Just one model. We would like to run without specify the AI core to boot fps (or qps).
I’am attaching 2 files:
orin32_1dla.log (960.3 KB)
orin64.log (10.3 MB)

orin32_1dla.log: which succeeded with --useDLACore=0, low qps (only 23.5)
orin64.log: which succeeded with the without --useDLACore, with --best options, very high qps (344)

orin32.log: which uses the same command as orin64, but failed. (The log file is too big, over 300MB) Your website doesn’t allow uploading. I put in the google drive, and you can access here:

Thanks.

Hi,

Thanks for sharing.

Based on your log, the memory configuration on both boards is set to ~30xxx MiB.
So it’s expected that the behavior to be similar and the model should work on both boards.

But it seems that the TensorRT version is 8.4.0 on the 64GB board.
Do you set up the device with JetPack 5.0.2? JetPack 5.0.2 should contain 8.4.1 instead.

Thanks.

My issue is at the Orin32 board. Someone else owns the Orin64 board. I don’t know the details of setup. Could you please let me know how to fix orin32 setup?

Orin32 TensorRt version
dpkg -l | grep nvinfer
ii libnvinfer-bin 8.4.1-1+cuda11.4 arm64 TensorRT binaries
ii libnvinfer-dev 8.4.1-1+cuda11.4 arm64 TensorRT development libraries and headers

orin32 jetpack version
sudo apt show nvidia-jetpack -a
Package: nvidia-jetpack
Version: 5.0.2-b231

I attached a subset of the model. Somehow it always fails at the red circle.
If I remove both the red circle node and concat layer after, everything works.
Or, if I keep the red circle, only deleting the concat after, everything still works.
So odd, seems not related to model size or gpu memory size.

Hi,

Is it the layer listed below that name called node 772?

11/23/2022-17:12:13] [V] [TRT] Parsing node: node_of_772 [Conv]
[11/23/2022-17:12:13] [V] [TRT] Searching for input: 771
[11/23/2022-17:12:13] [V] [TRT] Searching for input: hm_gcp.2.weight
[11/23/2022-17:12:13] [V] [TRT] Searching for input: hm_gcp.2.bias
[11/23/2022-17:12:13] [V] [TRT] node_of_772 [Conv] inputs: [771 -> (1, 64, 88, 168)[FLOAT]], [hm_gcp.2.weight -> (4, 64, 1, 1)[FLOAT]], [hm_gcp.2.bias -> (4)[FLOAT]], 
[11/23/2022-17:12:13] [V] [TRT] Convolution input dimensions: (1, 64, 88, 168)
[11/23/2022-17:12:13] [V] [TRT] Registering layer: node_of_772 for ONNX node: node_of_772
[11/23/2022-17:12:13] [V] [TRT] Using kernel: (1, 1), strides: (1, 1), prepadding: (0, 0), postpadding: (0, 0), dilations: (1, 1), numOutputs: 4
[11/23/2022-17:12:13] [V] [TRT] Convolution output dimensions: (1, 4, 88, 168)
[11/23/2022-17:12:13] [V] [TRT] Registering tensor: 772 for ONNX tensor: 772
[11/23/2022-17:12:13] [V] [TRT] node_of_772 [Conv] outputs: [772 -> (1, 4, 88, 168)[FLOAT]], 

We found there are some cuDNN errors on your networks.
Could you try to infer the model without using cuDNN to see if it works?

For example:

$ /usr/src/tensorrt/bin/trtexec --onnx=/usr/src/tensorrt/data/mnist/mnist.onnx --tacticSources=-CUDNN

Thanks.

Thanks. I can test for now, since I upgradedd to TensorRT 8.5.1 and got new problem :(
Still gpu memory though. Seems all my problems related to magic gpu mem.