Xavier issue for TensoRT&DLA cores

denvend · August 15, 2019, 7:46am

1:The difference of fixed point between TensorRT5 and TensorRT3 (it was found in the test that the table of fixed point using TensorRT5 and TensorRT3 was inconsistent).

2:the number of available DLA cores obtained by API IRuntime::getNbDLACores is unstable, occasionally returning 0 or 1, when the actual number of cores is 2

AastaLLL · August 15, 2019, 9:31am

Hi,

1.
There are lots of update from TensoRT3 to TensorRT5.
Mainly, the cuDNN algorithm is updated.

2.
Is there any other application use the DLA at the same time?

Thanks.

denvend · August 19, 2019, 9:11am

yes Running Tensor core and DLA at the same time,
We are concerned about whether it is possible that the hardware is unstable or other isuue, make this happened?

kayccc · August 29, 2019, 6:30am

Hi denvend,

Please provide the steps then we can try to repro, be sure to use the latest JacPack.

Thanks

shengliang.xu · October 14, 2019, 6:30am

Hitting the same issue for the DLA core query. I’m using Jetpack 4.2.2

my test code:

static Logger gLogger;

int main(int argc, char** argv)
{
    auto builder = std::unique_ptr<nvinfer1::IBuilder, samplesCommon::InferDeleter>(nvinfer1::createInferBuilder(gLogger));
    if (!builder)
        return false;
    std::cout << "Num DLA cores: " << builder->getNbDLACores() << std::endl;

    return 0;
}

result:

user@host:/user/host/dla$ while true; do ./query_dla_cores; done
Num DLA cores: 0
Num DLA cores: 0
Num DLA cores: 0
Num DLA cores: 2

Am I doing anything wrong?

AastaLLL · October 29, 2019, 3:52am

Hi,

Thanks for your question.

We are verifying this issue internally with our latest software.
Will update more information with you once we have further information.

Thanks.

AastaLLL · October 30, 2019, 7:01am

Hi, shengliang.xu

We cannot reproduce this issue on our board.
Are you using a customized Xavier board?

Thanks.

Ravik · November 2, 2019, 11:20am

haha, this is for drive AGX :)

DRIVE AGX Xavier™ Developer Kit (SKU 2000)

shengliang.xu · November 5, 2019, 2:19pm

Hi AastaLLL, I was inside a docker image. Could that be the cause? By the way I’ve retried running the simple query program outside of docker and yes it seems indeed working well.

cogwheel42 · December 12, 2019, 12:32am

I think I’m having a similar issue in this thread https://devtalk.nvidia.com/default/topic/1066291/jetson-agx-xavier/multiple-issues-running-nets-on-dla/post/5410797/#5410797

Loading an ONNX file with default device of DLA occasionally fails with:

I tell it to use core 0, so mCore.numEngines() must be returning 0 for some reason

There are no other applications running on the DLA. Is it possible something didn’t get cleaned up from a previous run and the hardware still thinks it’s occupied?

cogwheel42 · December 13, 2019, 9:11pm

Some more details in case it helps:

Most of the time I’ve seen this failure it’s in the DLANativeRunner as mentioned above. However, we call builder->getNbDLACores() to make sure it’s >= 1 before we even try creating the network. So between the time getNbDLACores() returns 2 and the time the network is being built, the “mCore” thinks it doesn’t have any engines available.

I was running a bunch of tests yesterday, and I actually had one case where getNbDLACores() failed before we built the network, so we had it fall back to the GPU.

I’ve also created an artificial test which can intermittently show this behavior:

int main(int argc, const char* argv[]) {
  int64_t count = 0;
  while (true) {
    auto builder = nvinfer1::createInferBuilder(coutLogger);
    auto num = builder->getNbDLACores();
    num = builder->getNbDLACores();
    num = builder->getNbDLACores();
    if (num == 0) {
      std::cout << count << " good before failure\n";
      return 1;
    } else {
      ++count;
    }
    builder->destroy();
  }
}

To get it to fail, I started a tmux session over ssh and ran this. It would be running successfully (never exit). Then I would create a new tmux window, and run it there. It would usually succeed so I would kill the process, exit the second window and create another. After a few rounds of this, one of the tmux windows would start failing (process exits and it says “0 good before failure”. But it wouldn’t fail every time, maybe only 50% of the time. Every time it would say “0 good” though, and if it didn’t fail immediately, it would succeed indefinitely.

Topic		Replies	Views
DLA and GPU cores at the same time Jetson AGX Xavier dla	20	10150	October 18, 2021
TensorRT run DLA on Xavier Jetson AGX Xavier nvbugs	11	1619	October 18, 2021
DLA and GPU running at the same time - performance question Jetson AGX Xavier nvbugs , performance , dla	24	3112	October 18, 2021
DLA / GPU question Jetson AGX Xavier dla	6	928	October 18, 2021
Multiple issues running nets on DLA Jetson AGX Xavier	15	1508	October 18, 2021
Tensorrt Python API has a bug in DLA usage Jetson AGX Xavier tensorrt	11	626	August 17, 2022
Cannot create DLA engine using trtexec on Xavier Jetson AGX Xavier tensorrt , dla	8	997	July 1, 2022
Issues when using DLA with TensorRT 7.1.3 compared to TensorRT 6.0.1 Jetson AGX Xavier nvbugs , dla	12	1750	January 12, 2022
Can not make tensorrt work on DLA (Jetson Xavier) Jetson AGX Xavier tensorrt , dla	3	576	October 18, 2021
Unable to build tensorrt engine with DLA enabled on Jetson Xavier NX Jetson Xavier NX tensorrt , cudnn	7	289	May 15, 2024

Xavier issue for TensoRT&DLA cores

Related topics