Get internal cask error using tf-trt sometimes

I am running my python script on nano using tf-trt. Normally, I start the script 10 times only one/two time there is no problem. Most of times I got following error.

python3: cask/shaderlist_impl.h:50: void cask::ShaderList<ShaderType, OperationType>::sortHandles() const [with ShaderType = cask::ConvolutionShader; OperationType = cask::Convolution]: Assertion ((*i)->handle != (*prevI)->handle) && "Internal error: CASK: all shaders must have unique names"' failed. python3: cask/shaderlist_impl.h:50: void cask::ShaderList<ShaderType, OperationType>::sortHandles() const [with ShaderType = cask::ConvolutionShader; OperationType = cask::Convolution]: Assertion ((*i)->handle != (*prevI)->handle) && “Internal error: CASK: all shaders must have unique names”’ failed.
Aborted (core dumped)

Or:

*** Received signal 11 ***
*** BEGIN MANGLED STACK TRACE ***
python3: cask/shaderlist_impl.h:50: void cask::ShaderList<ShaderType, OperationType>::sortHandles() const [with ShaderType = cask::ConvolutionShader; OperationType = cask::Convolution]: Assertion `((*i)->handle != (*prevI)->handle) && “Internal error: CASK: all shaders must have unique names”’ failed.
*** Received signal 6 ***
*** BEGIN MANGLED STACK TRACE ***
/usr/local/lib/python3.6/dist-packages/tensorflow/python/…/libtensorflow_framework.so(+0x72e368)[0x7f7005a368]
*** END MANGLED STACK TRACE ***

/usr/local/lib/python3.6/dist-packages/tensorflow/python/…/libtensorflow_framework.so(+0x72e368)[0x7f7005a368]
*** END MANGLED STACK TRACE ***

*** Begin stack trace ***
tensorflow::CurrentStackTraceabi:cxx11
*** End stack trace ***
Aborted (core dumped)

My environment is :

  • NVIDIA Jetson NANO
  • Jetpack: 4.2 [L4T 32.1.0]
  • CUDA: 10.0.166
  • cuDNN: 10.0.166
  • CUDA: 7.3.1.28-1+cuda10.0
  • TensorRT: 5.0.6.3-1+cuda10.0
  • Tensorflow: 1.13.1

There are no errors with creating the tf-trt graph on nano. The error occurs when calling sess.run().
How can I solve this? Thank you.

Hi,

We didn’t meet this before.
Would you mind to share you python script so we can reproduce this internally?

Thanks.

@AastaLLL Thanks for your replying. Here is the test file. https://www.dropbox.com/s/bubxh9v5bsu0saf/crash.zip?dl=0

Hi AastaLLL,

Any update about it? Is there a workaround for this?

Thanks

Hi,

We cannot de-serialize the .pb file in our environment.
The model file is very large and is hard to fit into Jetson Nano.

Do you meet any issue on the deserialization?

Thanks.

Hi AastaLLL,

I don’t have any issue on deserialization. It takes around 20 seconds for parsing and importing the graph from .pb file.
BTW, I added 6G swapfile on my Jetson Nono.

Thanks

Hi,

Guess the swap memory causes this issue.

Please noticed that swap memory cannot be used as a GPU memory.
This can also explain why there is a failure rate on this issue.

If the swap is used for CPU utilization, the application works well.
App failed if the swap is used as GPU memory.

It’s recommended to run the inference with pure TensorRT, which saves lots of memory.
Thanks.

Hi AastaLLL,

Got it. Thanks for explanation. I will try pure TensorRT.

Thanks.