Wrong results when running network on DLA instead of GPU

I posted this in another area of the forum. A moderator moved it to this one, but it’s not showing up in the list of recent posts, and no one has responded, so I’m reposting.

We have a segmentation network that we originally ran on GPU. Over the past few months we’ve adjusted it to run on the DLA (see this other thread for that saga)

Everything looks like it’s working now. We get no errors when building the network or running inference. However the results are incorrect. The inferred segmentation is basically one value over the whole thing except a few pixels in the upper-left corner.

However, if we disable these lines:

builder->setDefaultDeviceType(DeviceType::kDLA)
builder->allowGpuFallback(false)

and run on GPU everything works. The only difference is whether it’s running on DLA or not.

This happens in both 4.3 developer preview and 4.3 release

Hi,

We want to reproduce this issue on our side.
Would you mind to share your network with us for debugging?
Thanks.

We would prefer not to share a trained network on a public forum. If we can reproduce this result with an untrained network I’ll post it here. Otherwise is there a way to provide them privately? (preferably under NDA)

Hi,

You can share it with a private message.

Thanks.

So the mystery deepens. Remember from my original thread I was getting “internal dla error” for the layers that had dilation of 15 and those errors suddenly went away? It turns out there is something about training the network that causes it to behave differently. When I exported an untrained version, we started seeing those errors again.

I’m attaching a minimally-trained network. We have verified that it produces different results when running on DLA with no GPU fallback than when running on GPU, and also shows the internal DLA error.

Attachment has said “SCANNING… PLEASE WAIT” for over an hour. Here it is on google drive:
[deleted link]

Hello, I am wondering if there is any update on this. Is the problem reproducible on your end? Is there any other information we can provide to help troubleshoot?

Hi, I am a coworker of cogwheel42. Can we get an update on this? We want to commit to Xavier for our product but this is a serious problem.

Hi,

We are really sorry to keep you waiting.
Will update more information with you today asap.

Thanks.

Hi,
.
This issue is passed to our internal DLA team.
Will update more information with you once we got any feedback.

Thanks.

Hi, I’m just wondering if there is any new information to share. Has the issue been reproduced?

Hi,

Sorry that this issue is still under investigation.

But there is one thing worth a try.
We announce a new JetPack 4.4DP last week, which contains TensorRT 7.1.
Would you mind to check if this issue still occurs in the TensorRT 7.1?

Thanks.

Hi,

I am a coworker with cogwheel42 and I have a 3 layer reproducible example that generates incorrect results. The pytorch code below will output a onnx file that upsamples an image by 4x using deconvolution with stride=4 and no padding which according to the DLA restrictions (Developer Guide :: NVIDIA Deep Learning TensorRT Documentation) should be fine.

The onnx model is able to run on DLA but is incorrect results (mainly darker) when the number of output channels from the deconv is more than 16.

deconv_test.txt (2.5 KB)

jetpack 4.4dp and tensorrt 7.1 and onnx ir version 0.0.4

Duplicate topic for DLA bugs using deep-lab-v3 style network.
Please check the new topic for the latest status.

Thanks.