Trtexec model conversion crashed at insufficient gpu memory

I tried the approach. It still crashed . At the time of crash, the tegrastats shows the mem usage very low, only around 8GB.

I set the workspace to 2600, 2700, 26000, 27000, all the same results. I rebooted the orin box after each try.

BTW your unit is MB, so 26 GB should be 26000, not 2600. Anyway I tried both them all.

I attached the results for both 2600 and 26000, exactly the same error.

======= workspace:2600 ============================================

11-17-2022 14:22:29 RAM 8116/30536MB (lfb 4915x4MB) SWAP 0/15268MB (cached 0MB) CPU [100%@2201,0%@2201,0%@2201,0%@2201,3%@729,24%@729,22%@729,18%@729,60%@2201,2%@2201,0%@2201,0%@2201] EMC_FREQ 0% GR3D_FREQ 0% CV0@-256C CPU@49.5C Tboard@37C SOC2@45.093C Tdiode@39C SOC0@45.562C CV1@-256C GPU@44.375C tj@49.5C SOC1@44.218C CV2@-256C
11-17-2022 14:22:30 RAM 8116/30536MB (lfb 4915x4MB) SWAP 0/15268MB (cached 0MB) CPU [100%@2201,0%@2201,0%@2201,0%@2201,4%@729,19%@729,25%@729,19%@729,55%@2201,8%@2201,0%@2201,0%@2201] EMC_FREQ 0% GR3D_FREQ 0% CV0@-256C CPU@50.125C Tboard@37C SOC2@44.968C Tdiode@39.25C SOC0@45.562C CV1@-256C GPU@44.031C tj@50.125C SOC1@44.187C CV2@-256C
11-17-2022 14:22:31 RAM 8116/30536MB (lfb 4915x4MB) SWAP 0/15268MB (cached 0MB) CPU [100%@2201,0%@2201,0%@2201,0%@2201,7%@2201,22%@2201,20%@2201,50%@2201,9%@729,2%@729,0%@729,0%@729] EMC_FREQ 0% GR3D_FREQ 0% CV0@-256C CPU@50.343C Tboard@37C SOC2@45.093C Tdiode@39C SOC0@45.531C CV1@-256C GPU@44.062C tj@50.343C SOC1@44.218C CV2@-256C
11-17-2022 14:22:32 RAM 6142/30536MB (lfb 5180x4MB) SWAP 0/15268MB (cached 0MB) CPU [75%@2201,0%@2201,0%@2201,18%@2201,7%@729,14%@729,23%@729,11%@729,0%@729,0%@729,0%@729,0%@729] EMC_FREQ 0% GR3D_FREQ 5% CV0@-256C CPU@49.875C Tboard@37C SOC2@44.812C Tdiode@39C SOC0@45.625C CV1@-256C GPU@44.125C tj@49.281C SOC1@44.218C CV2@-256C
11-17-2022 14:22:33 RAM 4898/30536MB (lfb 5408x4MB) SWAP 0/15268MB (cached 0MB) CPU [0%@729,0%@729,0%@729,6%@729,0%@729,0%@729,0%@729,0%@729,0%@729,0%@729,0%@729,0%@729] EMC_FREQ 0% GR3D_FREQ 0% CV0@-256C CPU@48.5C Tboard@37C SOC2@44.437C Tdiode@39C SOC0@45.718C CV1@-256C GPU@-256C tj@48.906C SOC1@44.218C CV2@-256C

=== trtexec log ===
[11/17/2022-14:22:31] [W] [TRT] [0xaaab3ff24180]:151 :ScratchObject in storeCachedObject: at optimizer/gpu/cudnn/convolutionBuilder.cpp: 170 idx: 735118 time: 1.28e-07
[11/17/2022-14:22:31] [W] [TRT] [0xaaaaf570aef0]:4 :: weight zero-point in internalAllocate: at runtime/common/weightsPtr.cpp: 102 idx: 155 time: 1.6e-07
[11/17/2022-14:22:31] [W] [TRT] [0xaaab3fe4c9c0]:151 :ScratchObject in storeCachedObject: at optimizer/gpu/cudnn/convolutionBuilder.cpp: 170 idx: 563346 time: 9.6e-08
[11/17/2022-14:22:31] [W] [TRT] [0xaaab41d09900]:151 :ScratchObject in storeCachedObject: at optimizer/gpu/cudnn/convolutionBuilder.cpp: 170 idx: 735130 time: 2.56e-07
[11/17/2022-14:22:31] [W] [TRT] [0xaaab41cf9ff0]:151 :ScratchObject in storeCachedObject: at optimizer/gpu/cudnn/convolutionBuilder.cpp: 170 idx: 735142 time: 6.4e-08
[11/17/2022-14:22:31] [W] [TRT] -------------- The current device memory allocations dump as below --------------
[11/17/2022-14:22:31] [W] [TRT] [0]:89088 :CASK device reserved buffer in initCaskDeviceBuffer: at /_src/build/aarch64-gnu/release/runtime/gpu/cask/caskUtils.h: 459 idx: 244073 time: 0.0101787
[11/17/2022-14:22:31] [W] [TRT] Requested amount of GPU memory (89088 bytes) could not be allocated. There may not be enough free memory for allocation to succeed.
[11/17/2022-14:22:31] [W] [TRT] Skipping tactic 21 due to insufficient memory on requested size of 89088 detected for tactic 0xff4d370e229c1e8e.
[11/17/2022-14:22:31] [E] Error[4]: [optimizer.cpp::computeCosts::3625] Error Code 4: Internal Error (Could not find any implementation for node node_of_772 due to insufficient workspace. See verbose log for requested sizes.)
[11/17/2022-14:22:31] [E] Error[2]: [builder.cpp::buildSerializedNetwork::636] Error Code 2: Internal Error (Assertion engine != nullptr failed. )
[11/17/2022-14:22:31] [E] Engine could not be created from network
[11/17/2022-14:22:31] [E] Building engine failed
[11/17/2022-14:22:31] [E] Failed to create engine from model or file.
[11/17/2022-14:22:31] [E] Engine set up failed
&&&& FAILED TensorRT.trtexec [TensorRT v8401] # /usr/src/tensorrt/bin/trtexec --onnx=c3dv2.3.k4.onnx --saveEngine=c3d_best.engine --best --allowGPUFallback --memPoolSize=workspace:2600

======workspace:26000 ============================
=== at/after crash, mem diff is only 3G+ ===
11-17-2022 14:06:13 RAM 8131/30536MB (lfb 4884x4MB) SWAP 0/15268MB (cached 0MB) CPU [100%@2201,0%@2201,0%@2201,0%@2201,26%@729,2%@729,22%@729,18%@729,9%@2201,48%@2201,3%@2201,0%@2201] EMC_FREQ 0% GR3D_FREQ 0% CV0@-256C CPU@49.531C Tboard@37C SOC2@44.906C Tdiode@38.75C SOC0@45.718C CV1@-256C GPU@44.156C tj@49.75C SOC1@44.187C CV2@-256C
11-17-2022 14:06:14 RAM 8130/30536MB (lfb 4884x4MB) SWAP 0/15268MB (cached 0MB) CPU [96%@2201,0%@2201,0%@2201,0%@2201,19%@729,10%@729,10%@729,19%@729,0%@729,48%@729,5%@729,0%@729] EMC_FREQ 0% GR3D_FREQ 0% CV0@-256C CPU@49.218C Tboard@37C SOC2@44.687C Tdiode@39C SOC0@45.656C CV1@-256C GPU@44.187C tj@49.218C SOC1@44.218C CV2@-256C
11-17-2022 14:06:15 RAM 4905/30536MB (lfb 5358x4MB) SWAP 0/15268MB (cached 0MB) CPU [8%@729,0%@729,0%@729,26%@729,0%@729,0%@729,0%@729,0%@729,0%@729,0%@729,0%@729,0%@729] EMC_FREQ 0% GR3D_FREQ 0% CV0@-256C CPU@48.531C Tboard@37C SOC2@44.468C Tdiode@39C SOC0@45.437C CV1@-256C GPU@-256C tj@48.812C SOC1@44.281C CV2@-256C
11-17-2022 14:06:16 RAM 4905/30536MB (lfb 5358x4MB) SWAP 0/15268MB (cached 0MB) CPU [0%@729,0%@729,0%@729,0%@729,0%@729,0%@729,0%@729,0%@729,0%@729,0%@729,0%@729,0%@729] EMC_FREQ 0% GR3D_FREQ 0% CV0@-256C CPU@48.562C Tboard@37C SOC2@44.343C Tdiode@39C SOC0@45.375C CV1@-256C GPU@-256C tj@48.562C SOC1@44.281C CV2@-256C

=== trtexec log ===
11/17/2022-14:06:14] [W] [TRT] [0xaaab5c6c46d0]:151 :ScratchObject in storeCachedObject: at optimizer/gpu/cudnn/convolutionBuilder.cpp: 170 idx: 735130 time: 2.56e-07
[11/17/2022-14:06:14] [W] [TRT] [0xaaab5b03e980]:151 :ScratchObject in storeCachedObject: at optimizer/gpu/cudnn/convolutionBuilder.cpp: 170 idx: 735139 time: 1.28e-07
[11/17/2022-14:06:14] [W] [TRT] [0xaaab0f29c220]:4 :: weight zero-point in internalAllocate: at runtime/common/weightsPtr.cpp: 102 idx: 727 time: 2.56e-07
[11/17/2022-14:06:14] [W] [TRT] [0xaaab56e3e410]:151 :ScratchObject in storeCachedObject: at optimizer/gpu/cudnn/convolutionBuilder.cpp: 170 idx: 735142 time: 6.4e-08
[11/17/2022-14:06:14] [W] [TRT] [0xaaab5afc46b0]:151 :ScratchObject in storeCachedObject: at optimizer/gpu/cudnn/convolutionBuilder.cpp: 170 idx: 499319 time: 1.92e-07
[11/17/2022-14:06:14] [W] [TRT] -------------- The current device memory allocations dump as below --------------
[11/17/2022-14:06:14] [W] [TRT] [0]:89088 :CASK device reserved buffer in initCaskDeviceBuffer: at /_src/build/aarch64-gnu/release/runtime/gpu/cask/caskUtils.h: 459 idx: 244093 time: 0.0098853
[11/17/2022-14:06:14] [W] [TRT] Requested amount of GPU memory (89088 bytes) could not be allocated. There may not be enough free memory for allocation to succeed.
[11/17/2022-14:06:14] [W] [TRT] Skipping tactic 21 due to insufficient memory on requested size of 89088 detected for tactic 0xff4d370e229c1e8e.
[11/17/2022-14:06:14] [E] Error[10]: [optimizer.cpp::computeCosts::3628] Error Code 10: Internal Error (Could not find any implementation for node node_of_772.)
[11/17/2022-14:06:14] [E] Error[2]: [builder.cpp::buildSerializedNetwork::636] Error Code 2: Internal Error (Assertion engine != nullptr failed. )
[11/17/2022-14:06:14] [E] Engine could not be created from network
[11/17/2022-14:06:14] [E] Building engine failed
[11/17/2022-14:06:14] [E] Failed to create engine from model or file.
[11/17/2022-14:06:14] [E] Engine set up failed
&&&& FAILED TensorRT.trtexec [TensorRT v8401] # /usr/src/tensorrt/bin/trtexec --onnx=c3dv2.3.k4.onnx --saveEngine=c3d_best.engine --best --allowGPUFallback --memPoolSize=workspace:26000

Hi,

Sorry for the incorrect message about the amount.
I have updated the comment above.

Could you share the model with us so we can test it in our environment as well?
Thanks.

Sorry, I can’t give out the model. It just has lots of cnn layers. For the last few layers, there are near 30 parallel conv layers, and then concat the results.

Hi,

Just want to clarify the issue first.

There is only one model but the model contains several layers that are expected to run parallel.
Is that correct? Or you have multiple models that want to run concurrently?

To get more information about the crash, could you add the --verbose for both Orin 32GB and Orin 64GB and attach the output?
We want to compare the success and failure logs to find the possible cause.

Thanks.

Just one model. We would like to run without specify the AI core to boot fps (or qps).
I’am attaching 2 files:
orin32_1dla.log (960.3 KB)
orin64.log (10.3 MB)

orin32_1dla.log: which succeeded with --useDLACore=0, low qps (only 23.5)
orin64.log: which succeeded with the without --useDLACore, with --best options, very high qps (344)

orin32.log: which uses the same command as orin64, but failed. (The log file is too big, over 300MB) Your website doesn’t allow uploading. I put in the google drive, and you can access here:

Thanks.

Hi,

Thanks for sharing.

Based on your log, the memory configuration on both boards is set to ~30xxx MiB.
So it’s expected that the behavior to be similar and the model should work on both boards.

But it seems that the TensorRT version is 8.4.0 on the 64GB board.
Do you set up the device with JetPack 5.0.2? JetPack 5.0.2 should contain 8.4.1 instead.

Thanks.

My issue is at the Orin32 board. Someone else owns the Orin64 board. I don’t know the details of setup. Could you please let me know how to fix orin32 setup?

Orin32 TensorRt version
dpkg -l | grep nvinfer
ii libnvinfer-bin 8.4.1-1+cuda11.4 arm64 TensorRT binaries
ii libnvinfer-dev 8.4.1-1+cuda11.4 arm64 TensorRT development libraries and headers

orin32 jetpack version
sudo apt show nvidia-jetpack -a
Package: nvidia-jetpack
Version: 5.0.2-b231

I attached a subset of the model. Somehow it always fails at the red circle.
If I remove both the red circle node and concat layer after, everything works.
Or, if I keep the red circle, only deleting the concat after, everything still works.
So odd, seems not related to model size or gpu memory size.

Hi,

Is it the layer listed below that name called node 772?

11/23/2022-17:12:13] [V] [TRT] Parsing node: node_of_772 [Conv]
[11/23/2022-17:12:13] [V] [TRT] Searching for input: 771
[11/23/2022-17:12:13] [V] [TRT] Searching for input: hm_gcp.2.weight
[11/23/2022-17:12:13] [V] [TRT] Searching for input: hm_gcp.2.bias
[11/23/2022-17:12:13] [V] [TRT] node_of_772 [Conv] inputs: [771 -> (1, 64, 88, 168)[FLOAT]], [hm_gcp.2.weight -> (4, 64, 1, 1)[FLOAT]], [hm_gcp.2.bias -> (4)[FLOAT]], 
[11/23/2022-17:12:13] [V] [TRT] Convolution input dimensions: (1, 64, 88, 168)
[11/23/2022-17:12:13] [V] [TRT] Registering layer: node_of_772 for ONNX node: node_of_772
[11/23/2022-17:12:13] [V] [TRT] Using kernel: (1, 1), strides: (1, 1), prepadding: (0, 0), postpadding: (0, 0), dilations: (1, 1), numOutputs: 4
[11/23/2022-17:12:13] [V] [TRT] Convolution output dimensions: (1, 4, 88, 168)
[11/23/2022-17:12:13] [V] [TRT] Registering tensor: 772 for ONNX tensor: 772
[11/23/2022-17:12:13] [V] [TRT] node_of_772 [Conv] outputs: [772 -> (1, 4, 88, 168)[FLOAT]], 

We found there are some cuDNN errors on your networks.
Could you try to infer the model without using cuDNN to see if it works?

For example:

$ /usr/src/tensorrt/bin/trtexec --onnx=/usr/src/tensorrt/data/mnist/mnist.onnx --tacticSources=-CUDNN

Thanks.

Thanks. I can’t test for now, since I upgraded to TensorRT 8.5.1 and got new problem :(
Still gpu memory though. Seems all my problems related to magic gpu mem.

Hi,

Is it possible to share the model so we can get more information about the issue?
You can send it through a private message directly.

More, how do you upgrade to the TensorRT 8.5?
Please noted that you will need to use the package included in the JetPack for stability.

Thanks.

All models have issues now. I followed the steps in this link:

Got the TensorRt 8.5.1 deb package from your website, and then run “dpkg -i xxx”, then apt update, then apt install.

Then I noticed the problem with trtexec:
[12/06/2022-23:46:22] [I] TensorRT version: 8.5.1
[12/06/2022-23:46:23] [W] [TRT] Unable to determine GPU memory usage
[12/06/2022-23:46:23] [W] [TRT] Unable to determine GPU memory usage
[12/06/2022-23:46:23] [I] [TRT] [MemUsageChange] Init CUDA: CPU +8, GPU +0, now: CPU 20, GPU 0 (MiB)
[12/06/2022-23:46:23] [W] [TRT] CUDA initialization failure with error: 222. Please check your CUDA installation: Installation Guide Linux :: CUDA

It suggested cuda toolkits. Then I tried to install cuda, still broken, so today I installed newer cudnn. All in vain.

What do you mean the packages in jetpack? I didn’t change jetpack this time. Are you suggesting I shouldn’t install TensorRt myself? Installing jetpack as a whole is safer?

BTW, we installed jetpack R35.1 in August, which has TensorRt8.4.1. But nsys-ui didn’t show any DLA activity, event with “nsys profile --accelerator-mode=nvmedia”. But your company link: Developer Guide :: NVIDIA Deep Learning TensorRT Documentation shows DLA activity. Since yours is TensorRt 8.5.1, I’m hoping upgrading TensorRt will help it, but regretfully all went backward .

What’s your latest jetpack version? Maybe I should install jetpack instead of (tensortr/cuda/cudnn separately)? Any need to upgrade graphic driver?

BTW, qll my attempts logged here:

Many thanks!

Ok. I removed the newly installed packages and went back to jetpack. I’m able to run again. With your suggestion, the model compiler still failed:

[12/08/2022-15:23:56] [W] [TRT] Skipping tactic 21 due to insufficient memory on requested size of 89088 detected for tactic 0xff4d370e229c1e8e.
Try decreasing the workspace size with IBuilderConfig::setMemoryPoolLimit().
[12/08/2022-15:23:56] [E] Error[10]: [optimizer.cpp::computeCosts::3628] Error Code 10: Internal Error (Could not find any implementation for node node_of_772.)
[12/08/2022-15:23:56] [E] Error[2]: [builder.cpp::buildSerializedNetwork::636] Error Code 2: Internal Error (Assertion engine != nullptr failed. )
[12/08/2022-15:23:56] [E] Engine could not be created from network
[12/08/2022-15:23:56] [E] Building engine failed
[12/08/2022-15:23:56] [E] Failed to create engine from model or file.
[12/08/2022-15:23:56] [E] Engine set up failed
&&&& FAILED TensorRT.trtexec [TensorRT v8401] # /usr/src/tensorrt/bin/trtexec --onnx=c3dv2.3.k4.onnx --saveEngine=c3d_best_gpu.engine --best --allowGPUFallback --tacticSources=-CUDNN

Sorry I’m not allowed to give the model out. I can describe the last few layers though. As you can see in the picture above, relu output is 24x88x168.

Then it fans out into 24 3x3 convs, each with relu, then 1x1 conv, then concat.
These 24 outputs are in 4 groups, i.e. concat into 4 heads(7 inputs, 6 inputs, 4 inputs, 7 inputs). The issue is the last conv in 3rd group .

The odd things is, this conv params are exactly the same as other 1x1 conv.

Here is the details of the nearby nodes. The params are the same as other nodes in the model. Very odd. We have several similar models, all failed at the same place.

Hi,

Jetson package and libraries have some dependencies.
So you will need to use the combination that comes from the same JetPack version.

For the model, is there any public model that uses similar architecture like yours?
If yes, we can try if we can reproduce with that model.

Thanks.

Our model is centernet based. I don’t have the open onnx model at the hand though.

Hi,

Is below the centernet you mentioned?

If yes, which variance do you use?
Thanks.

There are many changes which I don’t fully understand.

BTW, another vendor has two centernet onnx models, one resnet50 based and the other resnet101 based. Both can run well on Orin32 with gpu-only mode. Since there are private properties of another vendor, I can’t give to you either.

Since our model with tiny change can run well on Orin32, I think it’s too hard for you to reproduce the issue.

Given we have workaround already, I will let the other team in my company deal the problem from now on. (My job is to evaluate the performance of Orin32, which I have answers already, very good :)

Really appreciate your helps!

Happy holidays!

Thanks for sharing this.
It’s good to know the performance meets your requirement.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.