1:The difference of fixed point between TensorRT5 and TensorRT3 (it was found in the test that the table of fixed point using TensorRT5 and TensorRT3 was inconsistent).
2:the number of available DLA cores obtained by API IRuntime::getNbDLACores is unstable, occasionally returning 0 or 1, when the actual number of cores is 2
yes Running Tensor core and DLA at the same time,
We are concerned about whether it is possible that the hardware is unstable or other isuue, make this happened?
Hi AastaLLL, I was inside a docker image. Could that be the cause? By the way I’ve retried running the simple query program outside of docker and yes it seems indeed working well.
Loading an ONNX file with default device of DLA occasionally fails with:
I tell it to use core 0, so mCore.numEngines() must be returning 0 for some reason
There are no other applications running on the DLA. Is it possible something didn’t get cleaned up from a previous run and the hardware still thinks it’s occupied?
Most of the time I’ve seen this failure it’s in the DLANativeRunner as mentioned above. However, we call builder->getNbDLACores() to make sure it’s >= 1 before we even try creating the network. So between the time getNbDLACores() returns 2 and the time the network is being built, the “mCore” thinks it doesn’t have any engines available.
I was running a bunch of tests yesterday, and I actually had one case where getNbDLACores() failed before we built the network, so we had it fall back to the GPU.
I’ve also created an artificial test which can intermittently show this behavior:
int main(int argc, const char* argv[]) {
int64_t count = 0;
while (true) {
auto builder = nvinfer1::createInferBuilder(coutLogger);
auto num = builder->getNbDLACores();
num = builder->getNbDLACores();
num = builder->getNbDLACores();
if (num == 0) {
std::cout << count << " good before failure\n";
return 1;
} else {
++count;
}
builder->destroy();
}
}
To get it to fail, I started a tmux session over ssh and ran this. It would be running successfully (never exit). Then I would create a new tmux window, and run it there. It would usually succeed so I would kill the process, exit the second window and create another. After a few rounds of this, one of the tmux windows would start failing (process exits and it says “0 good before failure”. But it wouldn’t fail every time, maybe only 50% of the time. Every time it would say “0 good” though, and if it didn’t fail immediately, it would succeed indefinitely.