Xavier issue for TensoRT&DLA cores

1:The difference of fixed point between TensorRT5 and TensorRT3 (it was found in the test that the table of fixed point using TensorRT5 and TensorRT3 was inconsistent).

2:the number of available DLA cores obtained by API IRuntime::getNbDLACores is unstable, occasionally returning 0 or 1, when the actual number of cores is 2

Hi,

1.
There are lots of update from TensoRT3 to TensorRT5.
Mainly, the cuDNN algorithm is updated.

2.
Is there any other application use the DLA at the same time?

Thanks.

yes Running Tensor core and DLA at the same time,
We are concerned about whether it is possible that the hardware is unstable or other isuue, make this happened?

Hi denvend,

Please provide the steps then we can try to repro, be sure to use the latest JacPack.

Thanks

Hitting the same issue for the DLA core query. I’m using Jetpack 4.2.2

my test code:

static Logger gLogger;

int main(int argc, char** argv)
{
    auto builder = std::unique_ptr<nvinfer1::IBuilder, samplesCommon::InferDeleter>(nvinfer1::createInferBuilder(gLogger));
    if (!builder)
        return false;
    std::cout << "Num DLA cores: " << builder->getNbDLACores() << std::endl;

    return 0;
}

result:

user@host:/user/host/dla$ while true; do ./query_dla_cores; done
Num DLA cores: 0
Num DLA cores: 0
Num DLA cores: 0
Num DLA cores: 2

Am I doing anything wrong?

Hi,

Thanks for your question.

We are verifying this issue internally with our latest software.
Will update more information with you once we have further information.

Thanks.

Hi, shengliang.xu

We cannot reproduce this issue on our board.
Are you using a customized Xavier board?

Thanks.

haha, this is for drive AGX :)

DRIVE AGX Xavier™ Developer Kit (SKU 2000)

Hi AastaLLL, I was inside a docker image. Could that be the cause? By the way I’ve retried running the simple query program outside of docker and yes it seems indeed working well.

I think I’m having a similar issue in this thread https://devtalk.nvidia.com/default/topic/1066291/jetson-agx-xavier/multiple-issues-running-nets-on-dla/post/5410797/#5410797

Loading an ONNX file with default device of DLA occasionally fails with:

I tell it to use core 0, so mCore.numEngines() must be returning 0 for some reason

There are no other applications running on the DLA. Is it possible something didn’t get cleaned up from a previous run and the hardware still thinks it’s occupied?

Some more details in case it helps:

Most of the time I’ve seen this failure it’s in the DLANativeRunner as mentioned above. However, we call builder->getNbDLACores() to make sure it’s >= 1 before we even try creating the network. So between the time getNbDLACores() returns 2 and the time the network is being built, the “mCore” thinks it doesn’t have any engines available.

I was running a bunch of tests yesterday, and I actually had one case where getNbDLACores() failed before we built the network, so we had it fall back to the GPU.

I’ve also created an artificial test which can intermittently show this behavior:

int main(int argc, const char* argv[]) {
  int64_t count = 0;
  while (true) {
    auto builder = nvinfer1::createInferBuilder(coutLogger);
    auto num = builder->getNbDLACores();
    num = builder->getNbDLACores();
    num = builder->getNbDLACores();
    if (num == 0) {
      std::cout << count << " good before failure\n";
      return 1;
    } else {
      ++count;
    }
    builder->destroy();
  }
}

To get it to fail, I started a tmux session over ssh and ran this. It would be running successfully (never exit). Then I would create a new tmux window, and run it there. It would usually succeed so I would kill the process, exit the second window and create another. After a few rounds of this, one of the tmux windows would start failing (process exits and it says “0 good before failure”. But it wouldn’t fail every time, maybe only 50% of the time. Every time it would say “0 good” though, and if it didn’t fail immediately, it would succeed indefinitely.