INTERNAL_ERROR: Assertion failed: eglCreateStreamKHR != nullptr

I’m following this guide: https://devblogs.nvidia.com/speed-up-inference-tensorrt/

My goal is to run Tensor RT C++ sample code on a Xavier as a proof of concept. I’m hitting this assertion: INTERNAL_ERROR: Assertion failed: eglCreateStreamKHR != nullptr

I’m running the following command given in the guide linked above.

./simpleOnnx_1 resnet50v2/resnet50v2.onnx resnet50v2/test_data_set_0/input_0.pb

Looking at the source code for this example, I’m seeing the assertion is happening inside IBuilder::buildCudaEngine()

ICudaEngine* createCudaEngine(string const& onnxModelPath, int batchSize)
{
    unique_ptr<IBuilder, Destroy<IBuilder>> builder{createInferBuilder(gLogger)};
    unique_ptr<INetworkDefinition, Destroy<INetworkDefinition>> network{builder->createNetwork()};
    unique_ptr<nvonnxparser::IParser, Destroy<nvonnxparser::IParser>> parser{nvonnxparser::createParser(*network, gLogger)};

    if (!parser->parseFromFile(onnxModelPath.c_str(), static_cast<int>(ILogger::Severity::kINFO)))
    {
        cout << "ERROR: could not parse input engine." << endl;
        return nullptr;
    }

    return builder->buildCudaEngine(*network); // Build and return TensorRT engine.
}

Here’s the console output:

<snip>
INFO: Fusing (Unnamed Layer* 164) [Convolution] with (Unnamed Layer* 166) [Activation]
INFO: Fusing (Unnamed Layer* 167) [Convolution] with (Unnamed Layer* 168) [ElementWise]
INFO: Fusing (Unnamed Layer* 169) [Scale] with (Unnamed Layer* 170) [Activation]
INFO: Fusing (Unnamed Layer* 172) [Shuffle] with (Unnamed Layer* 173) [Shuffle]
INFO: After vertical fusions: 75 layers
INFO: After swap: 75 layers
INFO: After final dead-layer removal: 75 layers
INFO: After tensor merging: 75 layers
INFO: After concat removal: 75 layers
INFO: Graph construction and optimization completed in 0.0620393 seconds.
INTERNAL_ERROR: Assertion failed: eglCreateStreamKHR != nullptr
dla/eglUtils.cpp:56
Aborting...

Aborted

I have installed the following on this system:

apt-get install -y cuda-toolkit-10-0 libgomp1 libfreeimage-dev libopenmpi-dev openmpi-bin
dpkg -i libcudnn7_7.3.1.20-1+cuda10.0_arm64.deb
dpkg -i libcudnn7-dev_7.3.1.20-1+cuda10.0_arm64.deb

dpkg -i libnvinfer5_5.0.3-1+cuda10.0_arm64.deb
dpkg -i libnvinfer-dev_5.0.3-1+cuda10.0_arm64.deb
dpkg -i libnvinfer-samples_5.0.3-1+cuda10.0_all.deb
dpkg -i libgie-dev_5.0.3-1+cuda10.0_all.deb
dpkg -i tensorrt_5.0.3.2-1+cuda10.0_arm64.deb

dpkg -i libopencv_3.3.1_arm64.deb
dpkg -i libopencv-dev_3.3.1_arm64.deb
dpkg -i libopencv-python_3.3.1_arm64.deb

What is the reason for this assertion? Is there a way I can work around it? Thanks.

Also, I hit the same issue using trtexec utility…

$ /usr/src/tensorrt/bin/trtexec --onnx=./resnet50v2/resnet50v2.onnx --input=./resnet50v2/test_data_set_0/input_0.pb 
onnx: ./resnet50v2/resnet50v2.onnx
input: ./resnet50v2/test_data_set_0/input_0.pb
----------------------------------------------------------------
Input filename:   ./resnet50v2/resnet50v2.onnx
ONNX IR version:  0.0.3
Opset version:    7
Producer name:    
Producer version: 
Domain:           
Model version:    0
Doc string:       
----------------------------------------------------------------
 ----- Parsing of ONNX model ./resnet50v2/resnet50v2.onnx is Done ---- 
Assertion failed: eglCreateStreamKHR != nullptr
dla/eglUtils.cpp:56
Aborting...

Aborted (core dumped)

Here’s some more information…

I attempted to run the following:

/usr/src/tensorrt/bin/trtexec --deploy=/usr/src/tensorrt/data/mnist/mnist.prototxt --output=prob --useDLACore=0 --fp16 --allowGPUFallback

And received a segmentation fault rather than an assertion. I reran it in the debugger, and here’s the backtrace

#0  0x0000007f94254ae8 in ?? () from /usr/lib/aarch64-linux-gnu/tegra/libnvdla_compiler.so
#1  0x0000007fb09b5640 in nvinfer1::utility::dla::TmpWisdom::compile(int, int) () from /usr/lib/aarch64-linux-gnu/libnvinfer.so.5
#2  0x0000007fb09c1840 in nvinfer1::builder::dla::validateGraphNode(std::unique_ptr<nvinfer1::builder::Node, std::default_delete<nvinfer1::builder::Node> > const&) ()
   from /usr/lib/aarch64-linux-gnu/libnvinfer.so.5
#3  0x0000007fb09312ac in nvinfer1::builder::createForeignNodes(nvinfer1::builder::Graph&, nvinfer1::builder::ForeignNode* (*)(nvinfer1::Backend, std::string const&), nvinfer1::CudaEngineBuildConfig const&) () from /usr/lib/aarch64-linux-gnu/libnvinfer.so.5
#4  0x0000007fb097d504 in nvinfer1::builder::applyGenericOptimizations(nvinfer1::builder::Graph&, nvinfer1::CpuMemoryGroup&, nvinfer1::CudaEngineBuildConfig const&) ()
   from /usr/lib/aarch64-linux-gnu/libnvinfer.so.5
#5  0x0000007fb094542c in nvinfer1::builder::buildEngine(nvinfer1::CudaEngineBuildConfig&, nvinfer1::rt::HardwareContext const&, nvinfer1::Network const&) ()
   from /usr/lib/aarch64-linux-gnu/libnvinfer.so.5
#6  0x0000007fb09b02ec in nvinfer1::builder::Builder::buildCudaEngine(nvinfer1::INetworkDefinition&) () from /usr/lib/aarch64-linux-gnu/libnvinfer.so.5
#7  0x0000000000403958 in caffeToTRTModel() ()
#8  0x0000000000405bc4 in createEngine() ()
#9  0x0000000000406088 in main ()

Hi,

Do you manually install the cuDNN and TensorRT package for Xavier?
If yes, could you try to use JetPack to setup the environment?
https://developer.nvidia.com/embedded/downloads#?search=JetPack%204.1.1

This will prevent you from meeting any dependency issue.
Thanks.

I just used JetPack to setup the environment, and I was able to run the trtexec command mentioned in my previous post without any crashing.

Now that we know it works when the system is installed through JetPack, can you help me understand what could be missing from my custom environment that is causing this? I mentioned which deb packages I am installing in the initial post. I am using the same packages used by JetPack.

I will report back if I manage to find the discrepancy myself.

Hi behrooze.sirang,

Could you take a look at logs under [JetPack Installation Directory]/_installer/logs/Xavier/ as a reference?

I took a look at the logs and I’m basically running the same commands to install the various .deb packages.

After bringing up an Ubuntu 18.04 environment in a docker container, I was still getting the same failure.

I then did some analysis on the dynamic loader via LD_DEBUG=all option and found the following discrepancy.

I ran the following command in both environments to dump the LD_DEBUG contents to a file for analysis…

LD_DEBUG=all /usr/src/tensorrt/bin/trtexec --deploy=/usr/src/tensorrt/data/mnist/mnist.prototxt --output=prob --useDLACore=0 --fp16 --allowGPUFallback 2> ~/ld_debug.out

In the JetPack installed environment the following was being loaded:

trying file=/usr/lib/aarch64-linux-gnu/tegra-egl/libEGL_nvidia.so.0

In my custom 18.04 environment, the following was being loaded:

trying file=/usr/lib/aarch64-linux-gnu/libEGL_mesa.so.0

I tracked down that this is an issue with the vendor-neutral dispatching, and I needed to do the following to configure nvidia as the EGL vendor.

cd /usr/share/glvnd/egl_vendor.d
ln -s ../../../../usr/lib/aarch64-linux-gnu/tegra-egl/nvidia.json 10_nvidia.json

And now I can run the trtexec command in my custom 18.04 environment…

Now, I am trying to get a working solution in Ubuntu Xavier (16.04). I still have not found a solution. The glvnd library in the JetPack environment comes from libegl1 and libglvnd0 packages, which are only available for Ubuntu Bionic (18.04). I found that NVIDIA has a glvnd library (https://github.com/NVIDIA/libglvnd), and I have it installed, but I am still getting the same assertion. I’m assuming the problem is still roughly the same (the NVIDIA vendor-specific libraries are not being loaded properly), but I am having trouble verifying that.

Do you have any next troubleshooting steps?

You could try replacing “/usr/lib/aarch64-linux-gnu/libEGL_mesa.so.0” with a copy of “/usr/lib/aarch64-linux-gnu/tegra-egl/libEGL_nvidia.so.0”. There are a few places where mesa will be incompatible with NVIDIA hardware accelerated versions, but could replace the NVIDIA version in some cases during a package update. If this is the case, then ssh access and serial console should still work (or even CTRL-ALT-F2). Simply save the original mesa file to a new name (I tend to gzip a file to change its name, and then later to revert I simply gunzip).

Make sure all show “ok” from “sha1sum -c /etc/nv_tegra_release”.

I didn’t figure out how to configure NVIDIA’s glvnd library. The readme mentioned the following:

“In order to find the available vendor libraries, each vendor provides a JSON file in a well-known directory, similar to how Vulkan ICD’s are loaded.”

I didn’t find actual, detailed instructions related to the above quote.

Anyway, I managed to get something functional in my 16.04 environment. I ended up installing a few Bionic packages related to glvnd and doing away with NVIDIA’s glvnd, and I managed to get it to run… It’s not the most kosher solution, but it’ll do for me for now.

#!/bin/bash

WORK_DIR=/tmp/glvnd

mkdir -p $WORK_DIR
cd $WORK_DIR

LIBGLVND0_URL=http://ports.ubuntu.com/ubuntu-ports/pool/main/libg/libglvnd/libglvnd0_1.0.0-2ubuntu2.2_arm64.deb
LIBEGL1MESA_URL=http://ports.ubuntu.com/ubuntu-ports/pool/main/m/mesa/libegl1-mesa_18.0.5-0ubuntu0~18.04.1_arm64.deb
LIBEGL1_URL=http://ports.ubuntu.com/ubuntu-ports/pool/main/libg/libglvnd/libegl1_1.0.0-2ubuntu2.2_arm64.deb

wget $LIBGLVND0_URL
wget $LIBEGL1MESA_URL
wget $LIBEGL1_URL

dpkg --install $(basename $LIBGLVND0_URL)
dpkg --unpack  $(basename $LIBEGL1MESA_URL)
dpkg --unpack  $(basename $LIBEGL1_URL)

ldconfig

rm -r $WORK_DIR

I have the same problem with running the script inside of a docker with Jetson Xavier.

I am able to run the script without the docker container; I am also able to run the script without TensorRT inside the container. But when I try to run the script inside Docker w/ TensorRT, I get:

2019-06-19 20:02:53.976468: F tensorflow/contrib/tensorrt/log/trt_logger.cc:42] DefaultLogger Assertion failed: eglCreateStreamKHR != nullptr
dla/eglUtils.cpp:61
Aborting...

Please help, thank you.

UPDATE:
I ran

dpkg -l | grep TensorRT

inside the container, nothing is showing up, meaning, somehow CUDA passed through, but TensorRT did not get successfully passed through.

Hi,

Do you have a TensorRT inside your container?

You will need an aarch64 package and enable some hardware config.
Here is a good example for your reference:
https://github.com/Technica-Corporation/Tegra-Docker

Thanks.