TL;DR:
- Getting Internal DLA error when using convolution with dilation of 12-15
- Intermittently succeeding to build the engine even when all inputs seem correct
We’re trying to load a variation of ERFNet from an ONNX file and run it on the Xavier DLA core. We’ve made a few changes to fit in the stated limits of the DLA:
- Reduced the maximum dilation on convolution layers from 16 to 15
- Use symmetric padding on the deconvolution layers
After making these changes, the initial parsing of the ONNX file succeeds, and all the layers return true for canRunOnDLA
However when attempting to build the engine, it emits a couple error messages for each dilation-15 layer, and later aborts with an assertion:
E1108 15:35:00.808091 13500 nvinfer.cpp:48] nvinfer1 error: Internal DLA error for layer (Unnamed Layer* 74) [Convolution]. Use allowGPUFallback() to enable GPU fallback.
E1108 15:35:00.809891 13500 nvinfer.cpp:48] nvinfer1 error: Internal DLA error for layer (Unnamed Layer* 74) [Convolution]. Use allowGPUFallback() to enable GPU fallback.
E1108 15:35:00.850761 13500 nvinfer.cpp:48] nvinfer1 error: Internal DLA error for layer (Unnamed Layer* 102) [Convolution]. Use allowGPUFallback() to enable GPU fallback.
E1108 15:35:00.852488 13500 nvinfer.cpp:48] nvinfer1 error: Internal DLA error for layer (Unnamed Layer* 102) [Convolution]. Use allowGPUFallback() to enable GPU fallback.
cuda_experiment: ../builder/cudnnBuilder2.cpp:761: nvinfer1::builder::buildSingleLayer(nvinfer1::rt::EngineBuildContext&, nvinfer1::builder::Node&, const RegionDict&, nvinfer1::cudnn::RegionScales*, bool)::<lambda(const nvinfer1::rt::EngineTensor&)>: Assertion `et.region->getType() == RegionType::kNVM' failed.
Thread 1 "cuda_experiment" received signal SIGABRT, Aborted.
__GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
51 ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
So I tried further reducing the maximum dilation in the noted layers. I had to go all the way down to 11 in order to avoid the Internal DLA errors. But even after doing that, I still usually end up seeing the assertion failure from cudnnBuilder2. I say usually, because the second time I tried running this under GDB, it succeeded. After further testing, I found that it occasionally works even outside of GDB.
It looks like a bunch of threads are spun up before the assertion, so I wonder if there is a race condition somewhere in the builder?
Here is what an unsuccessful run looks like:
Building engine
[New Thread 0x7f839931c0 (LWP 13605)]
[New Thread 0x7f831921c0 (LWP 13606)]
[New Thread 0x7f829911c0 (LWP 13607)]
[New Thread 0x7f821901c0 (LWP 13608)]
[New Thread 0x7f8198f1c0 (LWP 13609)]
[New Thread 0x7f8118e1c0 (LWP 13610)]
[New Thread 0x7f8098d1c0 (LWP 13611)]
cuda_experiment: ../builder/cudnnBuilder2.cpp:761: nvinfer1::builder::buildSingleLayer(nvinfer1::rt::EngineBuildContext&, nvinfer1::builder::Node&, const RegionDict&, nvinfer1::cudnn::RegionScales*, bool)::<lambda(const nvinfer1::rt::EngineTensor&)>: Assertion `et.region->getType() == RegionType::kNVM' failed.
Thread 1 "cuda_experiment" received signal SIGABRT, Aborted.
__GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
51 ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
And here’s a successful run:
Building engine
[New Thread 0x7f839931c0 (LWP 13619)]
[New Thread 0x7f831921c0 (LWP 13620)]
[New Thread 0x7f829911c0 (LWP 13621)]
[New Thread 0x7f821901c0 (LWP 13622)]
[New Thread 0x7f8198f1c0 (LWP 13623)]
[New Thread 0x7f8118e1c0 (LWP 13624)]
[New Thread 0x7f8098d1c0 (LWP 13625)]
[Thread 0x7f829911c0 (LWP 13621) exited]
[Thread 0x7f8118e1c0 (LWP 13624) exited]
[Thread 0x7f821901c0 (LWP 13622) exited]
[Thread 0x7f8198f1c0 (LWP 13623) exited]
[Thread 0x7f831921c0 (LWP 13620) exited]
[Thread 0x7f839931c0 (LWP 13619) exited]
[New Thread 0x7f8118e1c0 (LWP 13629)]
[New Thread 0x7f8198f1c0 (LWP 13630)]
[New Thread 0x7f821901c0 (LWP 13631)]
[New Thread 0x7f829911c0 (LWP 13632)]
[New Thread 0x7f839931c0 (LWP 13633)]
[New Thread 0x7f831921c0 (LWP 13634)]
Running inference
Time for 100 runs: 1185
[Thread 0x7f8198f1c0 (LWP 13630) exited]
[Thread 0x7f8118e1c0 (LWP 13629) exited]
[Thread 0x7f831921c0 (LWP 13634) exited]
[Thread 0x7f839931c0 (LWP 13633) exited]
[Thread 0x7f829911c0 (LWP 13632) exited]
[Thread 0x7f821901c0 (LWP 13631) exited]
[Thread 0x7f8098d1c0 (LWP 13625) exited]
[Thread 0x7f8a5601c0 (LWP 13616) exited]
[Thread 0x7f91529010 (LWP 13614) exited]
[Inferior 1 (process 13614) exited normally]