Multiple issues running nets on DLA

cogwheel42 · November 8, 2019, 11:49pm

TL;DR:

Getting Internal DLA error when using convolution with dilation of 12-15
Intermittently succeeding to build the engine even when all inputs seem correct

We’re trying to load a variation of ERFNet from an ONNX file and run it on the Xavier DLA core. We’ve made a few changes to fit in the stated limits of the DLA:

Reduced the maximum dilation on convolution layers from 16 to 15
Use symmetric padding on the deconvolution layers

After making these changes, the initial parsing of the ONNX file succeeds, and all the layers return true for canRunOnDLA

However when attempting to build the engine, it emits a couple error messages for each dilation-15 layer, and later aborts with an assertion:

E1108 15:35:00.808091 13500 nvinfer.cpp:48] nvinfer1 error: Internal DLA error for layer (Unnamed Layer* 74) [Convolution]. Use allowGPUFallback() to enable GPU fallback.
E1108 15:35:00.809891 13500 nvinfer.cpp:48] nvinfer1 error: Internal DLA error for layer (Unnamed Layer* 74) [Convolution]. Use allowGPUFallback() to enable GPU fallback.
E1108 15:35:00.850761 13500 nvinfer.cpp:48] nvinfer1 error: Internal DLA error for layer (Unnamed Layer* 102) [Convolution]. Use allowGPUFallback() to enable GPU fallback.
E1108 15:35:00.852488 13500 nvinfer.cpp:48] nvinfer1 error: Internal DLA error for layer (Unnamed Layer* 102) [Convolution]. Use allowGPUFallback() to enable GPU fallback.
cuda_experiment: ../builder/cudnnBuilder2.cpp:761: nvinfer1::builder::buildSingleLayer(nvinfer1::rt::EngineBuildContext&, nvinfer1::builder::Node&, const RegionDict&, nvinfer1::cudnn::RegionScales*, bool)::<lambda(const nvinfer1::rt::EngineTensor&)>: Assertion `et.region->getType() == RegionType::kNVM' failed.

Thread 1 "cuda_experiment" received signal SIGABRT, Aborted.
__GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
51	../sysdeps/unix/sysv/linux/raise.c: No such file or directory.

So I tried further reducing the maximum dilation in the noted layers. I had to go all the way down to 11 in order to avoid the Internal DLA errors. But even after doing that, I still usually end up seeing the assertion failure from cudnnBuilder2. I say usually, because the second time I tried running this under GDB, it succeeded. After further testing, I found that it occasionally works even outside of GDB.

It looks like a bunch of threads are spun up before the assertion, so I wonder if there is a race condition somewhere in the builder?

Here is what an unsuccessful run looks like:

Building engine
[New Thread 0x7f839931c0 (LWP 13605)]
[New Thread 0x7f831921c0 (LWP 13606)]
[New Thread 0x7f829911c0 (LWP 13607)]
[New Thread 0x7f821901c0 (LWP 13608)]
[New Thread 0x7f8198f1c0 (LWP 13609)]
[New Thread 0x7f8118e1c0 (LWP 13610)]
[New Thread 0x7f8098d1c0 (LWP 13611)]
cuda_experiment: ../builder/cudnnBuilder2.cpp:761: nvinfer1::builder::buildSingleLayer(nvinfer1::rt::EngineBuildContext&, nvinfer1::builder::Node&, const RegionDict&, nvinfer1::cudnn::RegionScales*, bool)::<lambda(const nvinfer1::rt::EngineTensor&)>: Assertion `et.region->getType() == RegionType::kNVM' failed.

Thread 1 "cuda_experiment" received signal SIGABRT, Aborted.
__GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
51	../sysdeps/unix/sysv/linux/raise.c: No such file or directory.

And here’s a successful run:

Building engine
[New Thread 0x7f839931c0 (LWP 13619)]
[New Thread 0x7f831921c0 (LWP 13620)]
[New Thread 0x7f829911c0 (LWP 13621)]
[New Thread 0x7f821901c0 (LWP 13622)]
[New Thread 0x7f8198f1c0 (LWP 13623)]
[New Thread 0x7f8118e1c0 (LWP 13624)]
[New Thread 0x7f8098d1c0 (LWP 13625)]
[Thread 0x7f829911c0 (LWP 13621) exited]
[Thread 0x7f8118e1c0 (LWP 13624) exited]
[Thread 0x7f821901c0 (LWP 13622) exited]
[Thread 0x7f8198f1c0 (LWP 13623) exited]
[Thread 0x7f831921c0 (LWP 13620) exited]
[Thread 0x7f839931c0 (LWP 13619) exited]
[New Thread 0x7f8118e1c0 (LWP 13629)]
[New Thread 0x7f8198f1c0 (LWP 13630)]
[New Thread 0x7f821901c0 (LWP 13631)]
[New Thread 0x7f829911c0 (LWP 13632)]
[New Thread 0x7f839931c0 (LWP 13633)]
[New Thread 0x7f831921c0 (LWP 13634)]
Running inference
Time for 100 runs: 1185
[Thread 0x7f8198f1c0 (LWP 13630) exited]
[Thread 0x7f8118e1c0 (LWP 13629) exited]
[Thread 0x7f831921c0 (LWP 13634) exited]
[Thread 0x7f839931c0 (LWP 13633) exited]
[Thread 0x7f829911c0 (LWP 13632) exited]
[Thread 0x7f821901c0 (LWP 13631) exited]
[Thread 0x7f8098d1c0 (LWP 13625) exited]
[Thread 0x7f8a5601c0 (LWP 13616) exited]
[Thread 0x7f91529010 (LWP 13614) exited]
[Inferior 1 (process 13614) exited normally]

AastaLLL · November 11, 2019, 9:01am

Hi,

Which JetPack version do you use?
Could you reproduce this issue with TensorRT v6.0 first?
https://developer.nvidia.com/jetpack-4_3_DP

Thanks.

cogwheel42 · November 12, 2019, 1:27am

We just upgraded to JetPack 4.2.2. Prior to that it was at least successfully falling back to the GPU.

I upgraded one of our units to 4.3 today and in general things are running (a lot faster too).

However, I’m still getting the Internal DLA error for layers with dilation 15 (which is supposedly within the limit).

AastaLLL · November 12, 2019, 3:40am

Hi,

Just want to confirm first:

1. The above error is fixed in JetPack4.3, is it correct?

Assertion `et.region->getType() == RegionType::kNVM' failed.

2. What kind of internal error do you meet when using dilation 15?
Is it same as above?

Thanks.

cogwheel42 · November 12, 2019, 5:10pm

yes, the RegionType assertion is fixed in 4.3
is the same error as above:

E1112 09:05:14.576885 19947 nvinfer.cpp:48] nvinfer1 error: Internal DLA error for layer (Unnamed Layer* 74) [Convolution]. Use allowGPUFallback() to enable GPU fallback.
E1112 09:05:14.578706 19947 nvinfer.cpp:48] nvinfer1 error: Internal DLA error for layer (Unnamed Layer* 74) [Convolution]. Use allowGPUFallback() to enable GPU fallback.
E1112 09:05:14.625368 19947 nvinfer.cpp:48] nvinfer1 error: Internal DLA error for layer (Unnamed Layer* 102) [Convolution]. Use allowGPUFallback() to enable GPU fallback.
E1112 09:05:14.627032 19947 nvinfer.cpp:48] nvinfer1 error: Internal DLA error for layer (Unnamed Layer* 102) [Convolution]. Use allowGPUFallback() to enable GPU fallback.

cogwheel42 · November 12, 2019, 5:12pm

Note that I have to turn the dilation to 11 in order to get it to work. 12-15 all fail with the same error.

cogwheel42 · November 15, 2019, 6:48pm

Is there any idea of why dilation of 12-15 won’t work?

AastaLLL · November 26, 2019, 9:04am

Hi,

Sorry for the late update.

Could you share a simple reproducible model with dilation of 12-15 with us?
We want to feedback this issue to our internal DLA team.

Thanks.

cogwheel42 · December 5, 2019, 11:22pm

After spending some time on other tasks, I exported an untrained model with dilation 15 to share. But when I tested it on the AGX it worked. I don’t why it suddenly started working… maybe something was wrong with the performance mode? Maybe something got fixed in an update? shrug

It looks like everything’s working properly on the DLA now.

cogwheel42 · December 10, 2019, 7:52pm

It turns out things didn’t really improve. I had our project running reliably on one of our AGXs so we set up a new master for cloning. The project ran successfully on that, so we cloned it to 3 AGXs, one of which was the original one it was working on.

Now we are getting an assortment of errors on the clones and none of them seem to be running successfully. For example:

Which suggests that the API thinks there aren’t any DLA cores on the device.

Or this one:

None of these errors are the same as the ones I was seeing before, but the fact that the last one is repeated twice makes me wonder if it’s related to the dilation 15 layers that seemed to be working.

Any thoughts?

cogwheel42 · December 10, 2019, 9:26pm

This is happening even after I revert our network to the one with dilation 11, so it doesn’t seem to be specific to the network.

cogwheel42 · December 12, 2019, 12:20am

It looks like the “loadBare failed” error was due to an issue with our clones. I’m no longer experiencing that. However, I still intermittently get the assertion that there are no DLA cores available.

AastaLLL · December 23, 2019, 9:38am

Hi,

It is good to know that the model works now.

For the DLA core issue, may I know which L4T version do you use?
We cannot reproduce this issue on our side so it’s hard for us to debug.

May I know is this issue can be reproduced on all the Xavier device? Or just some of them?

Thanks.

cogwheel42 · December 23, 2019, 6:06pm

It was originally working on L4T 32.2.2 from the JetPack 4.3 DP here: JetPack 4.3 Developer Preview | NVIDIA Developer

Then we built a custom driver and source_sync.sh pulled down 32.2.3. We cloned from that, and that’s when we started getting the “loadBare failed” error.

So we downgraded to 32.2.2 and rebuilt the kernel, and now it’s working again (aside from the intermittent getNbDLACores error noted here and in other threads).

AastaLLL · December 24, 2019, 5:48am

Hi,

There are some dependencies between different L4T version, especially GPU/DLA driver.
So it’s recommended to match the same OS version if you need to build a customized kernel.

Thanks.

Topic		Replies	Views
Xavier NX does not support adaptative average pooling on DLA? Jetson Xavier NX tensorrt	27	1105	October 11, 2023
Cannot build a TensorRT engine for DLA from a large ONNX file Jetson Xavier NX tensorrt , nvbugs , dla	12	2615	July 21, 2021
DLA , faster rcnn model error Jetson AGX Xavier	7	1484	October 18, 2021
Build tensorrt engine use DLA from onnx with trtexec on agx xavier jetpack 4.4 failed Jetson AGX Xavier tensorrt , dla	5	786	October 24, 2022
[Xavier NX + DLA] does not support dynamic shapes, and CBUF size requirement Jetson Xavier NX tensorrt , nvbugs , dla	9	1794	October 18, 2021
Tensorrt Python API has a bug in DLA usage Jetson AGX Xavier tensorrt	11	626	August 17, 2022
DLA execution fails with out of memory error Jetson AGX Xavier	5	721	October 18, 2021
Wrong results when running network on DLA instead of GPU Jetson AGX Xavier	14	1150	October 18, 2021
FP16 builder does not work, DLA does not accept anything, How to accelerate Deep Learning? Jetson AGX Xavier tensorrt	7	1184	February 9, 2022
TensorRT run DLA on Xavier Jetson AGX Xavier nvbugs	11	1619	October 18, 2021

Multiple issues running nets on DLA

Related topics