Multiple issues running nets on DLA

TL;DR:

  • Getting Internal DLA error when using convolution with dilation of 12-15
  • Intermittently succeeding to build the engine even when all inputs seem correct

We’re trying to load a variation of ERFNet from an ONNX file and run it on the Xavier DLA core. We’ve made a few changes to fit in the stated limits of the DLA:

  • Reduced the maximum dilation on convolution layers from 16 to 15
  • Use symmetric padding on the deconvolution layers

After making these changes, the initial parsing of the ONNX file succeeds, and all the layers return true for canRunOnDLA

However when attempting to build the engine, it emits a couple error messages for each dilation-15 layer, and later aborts with an assertion:

E1108 15:35:00.808091 13500 nvinfer.cpp:48] nvinfer1 error: Internal DLA error for layer (Unnamed Layer* 74) [Convolution]. Use allowGPUFallback() to enable GPU fallback.
E1108 15:35:00.809891 13500 nvinfer.cpp:48] nvinfer1 error: Internal DLA error for layer (Unnamed Layer* 74) [Convolution]. Use allowGPUFallback() to enable GPU fallback.
E1108 15:35:00.850761 13500 nvinfer.cpp:48] nvinfer1 error: Internal DLA error for layer (Unnamed Layer* 102) [Convolution]. Use allowGPUFallback() to enable GPU fallback.
E1108 15:35:00.852488 13500 nvinfer.cpp:48] nvinfer1 error: Internal DLA error for layer (Unnamed Layer* 102) [Convolution]. Use allowGPUFallback() to enable GPU fallback.
cuda_experiment: ../builder/cudnnBuilder2.cpp:761: nvinfer1::builder::buildSingleLayer(nvinfer1::rt::EngineBuildContext&, nvinfer1::builder::Node&, const RegionDict&, nvinfer1::cudnn::RegionScales*, bool)::<lambda(const nvinfer1::rt::EngineTensor&)>: Assertion `et.region->getType() == RegionType::kNVM' failed.

Thread 1 "cuda_experiment" received signal SIGABRT, Aborted.
__GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
51	../sysdeps/unix/sysv/linux/raise.c: No such file or directory.

So I tried further reducing the maximum dilation in the noted layers. I had to go all the way down to 11 in order to avoid the Internal DLA errors. But even after doing that, I still usually end up seeing the assertion failure from cudnnBuilder2. I say usually, because the second time I tried running this under GDB, it succeeded. After further testing, I found that it occasionally works even outside of GDB.

It looks like a bunch of threads are spun up before the assertion, so I wonder if there is a race condition somewhere in the builder?

Here is what an unsuccessful run looks like:

Building engine
[New Thread 0x7f839931c0 (LWP 13605)]
[New Thread 0x7f831921c0 (LWP 13606)]
[New Thread 0x7f829911c0 (LWP 13607)]
[New Thread 0x7f821901c0 (LWP 13608)]
[New Thread 0x7f8198f1c0 (LWP 13609)]
[New Thread 0x7f8118e1c0 (LWP 13610)]
[New Thread 0x7f8098d1c0 (LWP 13611)]
cuda_experiment: ../builder/cudnnBuilder2.cpp:761: nvinfer1::builder::buildSingleLayer(nvinfer1::rt::EngineBuildContext&, nvinfer1::builder::Node&, const RegionDict&, nvinfer1::cudnn::RegionScales*, bool)::<lambda(const nvinfer1::rt::EngineTensor&)>: Assertion `et.region->getType() == RegionType::kNVM' failed.

Thread 1 "cuda_experiment" received signal SIGABRT, Aborted.
__GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
51	../sysdeps/unix/sysv/linux/raise.c: No such file or directory.

And here’s a successful run:

Building engine
[New Thread 0x7f839931c0 (LWP 13619)]
[New Thread 0x7f831921c0 (LWP 13620)]
[New Thread 0x7f829911c0 (LWP 13621)]
[New Thread 0x7f821901c0 (LWP 13622)]
[New Thread 0x7f8198f1c0 (LWP 13623)]
[New Thread 0x7f8118e1c0 (LWP 13624)]
[New Thread 0x7f8098d1c0 (LWP 13625)]
[Thread 0x7f829911c0 (LWP 13621) exited]
[Thread 0x7f8118e1c0 (LWP 13624) exited]
[Thread 0x7f821901c0 (LWP 13622) exited]
[Thread 0x7f8198f1c0 (LWP 13623) exited]
[Thread 0x7f831921c0 (LWP 13620) exited]
[Thread 0x7f839931c0 (LWP 13619) exited]
[New Thread 0x7f8118e1c0 (LWP 13629)]
[New Thread 0x7f8198f1c0 (LWP 13630)]
[New Thread 0x7f821901c0 (LWP 13631)]
[New Thread 0x7f829911c0 (LWP 13632)]
[New Thread 0x7f839931c0 (LWP 13633)]
[New Thread 0x7f831921c0 (LWP 13634)]
Running inference
Time for 100 runs: 1185
[Thread 0x7f8198f1c0 (LWP 13630) exited]
[Thread 0x7f8118e1c0 (LWP 13629) exited]
[Thread 0x7f831921c0 (LWP 13634) exited]
[Thread 0x7f839931c0 (LWP 13633) exited]
[Thread 0x7f829911c0 (LWP 13632) exited]
[Thread 0x7f821901c0 (LWP 13631) exited]
[Thread 0x7f8098d1c0 (LWP 13625) exited]
[Thread 0x7f8a5601c0 (LWP 13616) exited]
[Thread 0x7f91529010 (LWP 13614) exited]
[Inferior 1 (process 13614) exited normally]

Hi,

Which JetPack version do you use?
Could you reproduce this issue with TensorRT v6.0 first?
https://developer.nvidia.com/jetpack-4_3_DP

Thanks.

We just upgraded to JetPack 4.2.2. Prior to that it was at least successfully falling back to the GPU.

I upgraded one of our units to 4.3 today and in general things are running (a lot faster too).

However, I’m still getting the Internal DLA error for layers with dilation 15 (which is supposedly within the limit).

Hi,

Just want to confirm first:

1. The above error is fixed in JetPack4.3, is it correct?

Assertion `et.region->getType() == RegionType::kNVM' failed.

2. What kind of internal error do you meet when using dilation 15?
Is it same as above?

Thanks.

  1. yes, the RegionType assertion is fixed in 4.3

  2. is the same error as above:

E1112 09:05:14.576885 19947 nvinfer.cpp:48] nvinfer1 error: Internal DLA error for layer (Unnamed Layer* 74) [Convolution]. Use allowGPUFallback() to enable GPU fallback.
E1112 09:05:14.578706 19947 nvinfer.cpp:48] nvinfer1 error: Internal DLA error for layer (Unnamed Layer* 74) [Convolution]. Use allowGPUFallback() to enable GPU fallback.
E1112 09:05:14.625368 19947 nvinfer.cpp:48] nvinfer1 error: Internal DLA error for layer (Unnamed Layer* 102) [Convolution]. Use allowGPUFallback() to enable GPU fallback.
E1112 09:05:14.627032 19947 nvinfer.cpp:48] nvinfer1 error: Internal DLA error for layer (Unnamed Layer* 102) [Convolution]. Use allowGPUFallback() to enable GPU fallback.

Note that I have to turn the dilation to 11 in order to get it to work. 12-15 all fail with the same error.

Is there any idea of why dilation of 12-15 won’t work?

Hi,

Sorry for the late update.

Could you share a simple reproducible model with dilation of 12-15 with us?
We want to feedback this issue to our internal DLA team.

Thanks.

After spending some time on other tasks, I exported an untrained model with dilation 15 to share. But when I tested it on the AGX it worked. I don’t why it suddenly started working… maybe something was wrong with the performance mode? Maybe something got fixed in an update? shrug

It looks like everything’s working properly on the DLA now.

It turns out things didn’t really improve. I had our project running reliably on one of our AGXs so we set up a new master for cloning. The project ran successfully on that, so we cloned it to 3 AGXs, one of which was the original one it was working on.

Now we are getting an assortment of errors on the clones and none of them seem to be running successfully. For example:

Which suggests that the API thinks there aren’t any DLA cores on the device.

Or this one:

None of these errors are the same as the ones I was seeing before, but the fact that the last one is repeated twice makes me wonder if it’s related to the dilation 15 layers that seemed to be working.

Any thoughts?

This is happening even after I revert our network to the one with dilation 11, so it doesn’t seem to be specific to the network.

It looks like the “loadBare failed” error was due to an issue with our clones. I’m no longer experiencing that. However, I still intermittently get the assertion that there are no DLA cores available.

Hi,

It is good to know that the model works now.

For the DLA core issue, may I know which L4T version do you use?
We cannot reproduce this issue on our side so it’s hard for us to debug.

May I know is this issue can be reproduced on all the Xavier device? Or just some of them?

Thanks.

It was originally working on L4T 32.2.2 from the JetPack 4.3 DP here: https://developer.nvidia.com/jetpack-4_3_DP

Then we built a custom driver and source_sync.sh pulled down 32.2.3. We cloned from that, and that’s when we started getting the “loadBare failed” error.

So we downgraded to 32.2.2 and rebuilt the kernel, and now it’s working again (aside from the intermittent getNbDLACores error noted here and in other threads).

Hi,

There are some dependencies between different L4T version, especially GPU/DLA driver.
So it’s recommended to match the same OS version if you need to build a customized kernel.

Thanks.