Trtexec fails with null pointer exception when useDLACore enabled

trtexec fails with null pointer exception when useDLACore enabled
AGX Orin
TensorRT 8517
Linux Artax 5.10.104-tegra #1 SMP PREEMPT Wed Aug 10 20:17:07 PDT 2022 aarch64 aarch64 aarch64 GNU/Linux
Ubuntu “20.04.5 LTS (Focal Fossa)”
Jetpack 5.0.2 - L4T 35.1.0

Full error: [03/02/2023-09:19:38] [W] --workspace flag has been deprecated by --memPoolSize - Pastebin.com

To reproduce:
using this source model:
rvm_mobilenetv3_fp32_input.onnx (14.3 MB)

execute command:
trtexec --onnx=rvm_mobilenetv3_fp32_input.onnx --workspace=8000 --saveEngine=rvm_mobilenetv3_fp32_output.engine --verbose --useDLACore=0 --allowGPUFallback

result:

[03/02/2023-09:19:46] [E] Error[2]: [eglUtils.cpp::operator()::72] Error Code 2: Internal Error (Assertion (eglCreateStreamKHR) != nullptr failed. )
[03/02/2023-09:19:46] [E] Error[2]: [builder.cpp::buildSerializedNetwork::636] Error Code 2: Internal Error (Assertion engine != nullptr failed. )
[03/02/2023-09:19:46] [E] Engine could not be created from network
[03/02/2023-09:19:46] [E] Building engine failed
[03/02/2023-09:19:46] [E] Failed to create engine from model or file.
[03/02/2023-09:19:46] [E] Engine set up failed

Hi,
Please refer to the installation steps from the below link if in case you are missing on anything

Also, we suggest you to use TRT NGC containers to avoid any system dependency related issues.

Thanks!

Hmmm, trtexec fails with

[03/04/2023-21:58:47] [W] [TRT] Unable to determine GPU memory usage
[03/04/2023-21:58:47] [W] [TRT] Unable to determine GPU memory usage
[03/04/2023-21:58:47] [I] [TRT] [MemUsageChange] Init CUDA: CPU +5, GPU +0, now: CPU 17, GPU 0 (MiB)
[03/04/2023-21:58:47] [W] [TRT] CUDA initialization failure with error: 222. Please check your CUDA installation:  http://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html
[03/04/2023-21:58:47] [E] Builder creation failed
[03/04/2023-21:58:47] [E] Failed to create engine from model or file.
[03/04/2023-21:58:47] [E] Engine set up failed

when I attempt to build using the container nvcr.io/nvidia/tensorrt:23.02-py3.

Even attempting to build the bundled python deps using /opt/tensorrt/python/python_setup.sh fails with : tech@Artax:/opt/metamirror/src/resources/builder$ ./test.shSending build conte - Pastebin.com

I resolved the build failure by adding the tegra-gl folder into the LD_LIBRARY_PATH; however, the resulting model’s inference is dismal.
6.58 FPS with the DLACore enabled, 42.71 FPS without. Also with the DLACore enabled TensorRT seems to be creating an arbitrary GL context preventing business logic from creating one resulting in application crash.

Correct me if I am wrong, but I was under the impression that running mixed precision across the dep learning cores with gpu fallback was expected to increase the performance of inference. (puzzled).

What I am seeing seems to mirror what’s mentioned here: Run pure conv2d node on DLA makes GPU get slower - #10 by AastaLLL

Hi @manbehindthemadness ,

Are you still facing the issue.

Thanks

The issue itself (null pointer) was resolved by including the GL libraries in the LD_Library_Path; however, running small-batch real time inference is still painfully slow with mixed precision. Are the DLAs designed for training acceleration only?

@manbehindthemadness
Please check out the DLA github page for samples and resources: Recipes and tools for running deep learning workloads on NVIDIA DLA cores for inference applications.

We have a FAQ page that addresses some common questions that we see developers run into: Deep-Learning-Accelerator-SW/FAQ

Thanks, I will read these resources and see what I can come up with.
Out of curiosity, where does the DLA exist physically? Is it part of the existing GPU/Tensor core arrangement, or does it have it’s own silicon?

It has its own silicon.

Oh, wow. So the 200 and change TOPS rating for the Orin module is derived from the DLA cores? Eg, without optimizing the engines to make use of them a vast portion of the potential performance would remain untapped… Definitely good to know.

One final question, will optimizing my engines to make use of the DLA cores provide meaningful computation advantages when used in a real-time single-batch inference, or are they designed exclusively for multi-batch training applications?

Sorry for the confusion - GPU/Tensor cores offer lot of the AI compute ( 2/3rds on the Jetson Orin AGX SoC) and the two DLA cores offer the rest 1/3. So, in cases where we dont have any DLA core like Nano, all of the AI compute is from GPU.

DLA cores are optimized for real time inference.

Thanks for the clarification

hello, I met the same problem, can you please describe the solution in detial, I can not find the tegra-gl folder.