tensorNet::LoadNetwork Internal error: CASK: all shaders must have unique names

Hello,

I am using jetpack 3.2.1 on TX2, and all related libraries, and project jetson-inference to make inference of a custom model.

My application runs 1 thread per camera, and start 1 cuda engine per camera , on the correspondent thread.

The problem is that sporadically, on function, create(), (and the stack trace points to tensorNet::LoadNetwork later on) the program throws and exception and abort. If create() function ends fine, the program start inference and everything goes well.

The complete text of exception is as follows:

tensornet::loadnetwork shaderlist_impl.h:50: void cask::shaderlist<shadertype, operationtype>::sorthandles() const [with shadertype = cask::convolutionshader; operationtype = cask::convolution]: assertion `((*i)->handle != (*previ)->handle) && “internal error: cask: all shaders must have unique names”’ failed

One detail that could be relevant is that I am cross compiling. Should I take care of add some flag to nvcc ?

This is an extract of CMakeList.txt file:

setup CUDA

find_package(CUDA)

set(CUDA_GEN_CODE
“arch=compute_{CUDA_CAP},code=sm_{CUDA_CAP}”
)

set(
CUDA_NVCC_FLAGS
{CUDA_NVCC_FLAGS}; -O3 -gencode {CUDA_GEN_CODE}
-Xlinker --unresolved-symbols=ignore-in-shared-libs
)


where I added the unresolved-symbols option as I have seen in some topic of nvidia forum that was needed to cross compile.

Please, any insight will be helpful.

Hi pellejero.nicolas, I’m not familiar with this error, but do you ever get it when using one of the built-in models, or only your custom model?

Also, I haven’t cross-compiled the project before - do you ever get the error when the project is compiled natively?

BTW here’s another thread I found with this error: https://devtalk.nvidia.com/default/topic/1044659/tensorrt/internal-error-cask-all-shaders-must-have-unique-names/

Not sure if it is a similar cause or not, but it’s mentioned “the error is usually due to dependency issues among different libraries”. In your case, perhaps that is related to the cross-compiling, or how the engine was generated depending on the type of custom model you are using.

Hi dusty,

I have tried the built-in models for a long time. I will check this.

Also, I think that the issue could be related to cross compile, so I will double check what happens if the project is compiled natively.

Yes I have seen that thread, and I don’t discard some incompatibility among libraries, but I am using a complete jetpack without mixing any versions.

So, I will check built in model, and custom model compiled natively and get back to you.

Thanks

Hi pellejero.nicolas,

Have you managed to resolved the issue? Any result can be shared?

The issue is not resolved. Regarding the tests dusty suggested, I tried the network compiled natively and in the long term ( after many tries) the problem appeared.

I couldn’t test the examples models of jetson inference, because we fork the project and now is far from it’s origins…

But I think that it is more related to multi threading , now that cross compilation is discarded.

Hi,

You can find the jetson_inference for rel-28.2.1(JetPack3.2.1) here:
https://github.com/dusty-nv/jetson-inference/tree/L4T-R28.2

We want to reproduce this issue in our environment.
Could we reproduce it by running several jetson-inference app at the same time?

Thanks.

Hi,

More, based on this commet:
https://devtalk.nvidia.com/default/topic/1044659/tensorrt/internal-error-cask-all-shaders-must-have-unique-names/post/5315706/#5315706

Have you tried to add the mutex mechanism before deserializing the TensorRT engine?
Thanks.

Hi AastaLLL, I use Jetson Inference as a library and I create 1 thread per camera and 1 inference engine per thread. So, in the applicaiton code, when I call the create function use a lock, like this:

// Create a new one (this may take a while)
this->_mutex.lock();
this->_net = convGAN::Create(this->_model, this->_max_batch_size);
this->_mutex.unlock();

So. convGAN::create calls convGAN::init who calls tensorNet::LoadNetwork, who makes the job. The lock is way higher than it could, but I think it should work.
However, I tested it and the problem persist.

What do you think ? Should I change the lock placement?

My other Idea was to try to share the inference engine between threads, and lock that class. What do you think about this one?

Thanks.

good news, I undid the previous lock , that was on a super high level, and made another one, in the 2 functions that are on:

https://devtalk.nvidia.com/default/topic/1044659/tensorrt/internal-error-cask-all-shaders-must-have-unique-names/post/5315706/#5315706

And the problem seems to be gone. So far, I restarted the network more than 200 times and didn’t happen again.