Unable to Initialize EGL

dheeraj.singh · February 10, 2023, 10:56am

Hello

I am trying to run PoseCNN algorithm on a RTX 4090 based system with Nvidia Driver 525.85 (installed using run file) using a Docker Container image: Cuda 11.7.0 devel ubuntu 20.04 from docker hub.

While running “python3 setup.py install” inside the dockerfile while image creation, the image creation fails:

[13/14] RUN cd /deps/PoseCNN/lib/layers && python3 setup.py install && cd /deps/PoseCNN/lib/utils && python3 setup.py build_ext --inplace && cd /deps/PoseCNN/ycb_render && python3 setup.py develop && cd …/ && ./build.sh:
#0 1.346 No CUDA runtime is found, using CUDA_HOME=‘/usr/local/cuda’
#0 1.351 running install
#0 1.389 running bdist_egg…
…
#0 1.414 Traceback (most recent call last):
#0 1.414 File “setup.py”, line 8, in
#0 1.414 setup(
#0 1.414 File “/usr/lib/python3/dist-packages/setuptools/init.py”, line 144, in setup
#0 1.414 return distutils.core.setup(**attrs)
#0 1.414 File “/usr/lib/python3.8/distutils/core.py”, line 148, in setup
#0 1.414 dist.run_commands()
#0 1.414 File “/usr/lib/python3.8/distutils/dist.py”, line 966, in run_commands
#0 1.414 self.run_command(cmd)
#0 1.414 File “/usr/lib/python3.8/distutils/dist.py”, line 985, in run_command
#0 1.414 cmd_obj.run()
#0 1.414 File “/usr/lib/python3/dist-packages/setuptools/command/install.py”, line 67, in run
#0 1.414 self.do_egg_install()
#0 1.414 File “/usr/lib/python3/dist-packages/setuptools/command/install.py”, line 109, in do_egg_install
#0 1.414 self.run_command(‘bdist_egg’)
#0 1.414 File “/usr/lib/python3.8/distutils/cmd.py”, line 313, in run_command
#0 1.414 self.distribution.run_command(command)
#0 1.414 File “/usr/lib/python3.8/distutils/dist.py”, line 985, in run_command
#0 1.414 cmd_obj.run()
#0 1.414 File “/usr/lib/python3/dist-packages/setuptools/command/bdist_egg.py”, line 172, in run
#0 1.414 cmd = self.call_command(‘install_lib’, warn_dir=0)
#0 1.414 File “/usr/lib/python3/dist-packages/setuptools/command/bdist_egg.py”, line 158, in call_command
#0 1.414 self.run_command(cmdname)
#0 1.414 File “/usr/lib/python3.8/distutils/cmd.py”, line 313, in run_command
#0 1.414 self.distribution.run_command(command)
#0 1.414 File “/usr/lib/python3.8/distutils/dist.py”, line 985, in run_command
#0 1.414 cmd_obj.run()
#0 1.414 File “/usr/lib/python3/dist-packages/setuptools/command/install_lib.py”, line 23, in run
#0 1.414 self.build()
#0 1.414 File “/usr/lib/python3.8/distutils/command/install_lib.py”, line 109, in build
#0 1.414 self.run_command(‘build_ext’)
#0 1.414 File “/usr/lib/python3.8/distutils/cmd.py”, line 313, in run_command
#0 1.415 self.distribution.run_command(command)
#0 1.415 File “/usr/lib/python3.8/distutils/dist.py”, line 985, in run_command
#0 1.415 cmd_obj.run()
#0 1.415 File “/usr/lib/python3/dist-packages/setuptools/command/build_ext.py”, line 87, in run
#0 1.415 _build_ext.run(self)
#0 1.415 File “/usr/local/lib/python3.8/dist-packages/Cython/Distutils/old_build_ext.py”, line 186, in run
#0 1.415 _build_ext.build_ext.run(self)
#0 1.415 File “/usr/lib/python3.8/distutils/command/build_ext.py”, line 340, in run
#0 1.415 self.build_extensions()
#0 1.415 File “/usr/local/lib/python3.8/dist-packages/torch/utils/cpp_extension.py”, line 843, in build_extensions
#0 1.415 build_ext.build_extensions(self)
#0 1.415 File “/usr/local/lib/python3.8/dist-packages/Cython/Distutils/old_build_ext.py”, line 195, in build_extensions
#0 1.415 _build_ext.build_ext.build_extensions(self)
#0 1.415 File “/usr/lib/python3.8/distutils/command/build_ext.py”, line 449, in build_extensions
#0 1.415 self._build_extensions_serial()
#0 1.415 File “/usr/lib/python3.8/distutils/command/build_ext.py”, line 474, in _build_extensions_serial
#0 1.415 self.build_extension(ext)
#0 1.415 File “/usr/lib/python3/dist-packages/setuptools/command/build_ext.py”, line 208, in build_extension
#0 1.415 _build_ext.build_extension(self, ext)
#0 1.415 File “/usr/lib/python3.8/distutils/command/build_ext.py”, line 528, in build_extension
#0 1.415 objects = self.compiler.compile(sources,
#0 1.415 File “/usr/lib/python3.8/distutils/ccompiler.py”, line 574, in compile
#0 1.415 self._compile(obj, src, ext, cc_args, extra_postargs, pp_opts)
#0 1.415 File “/usr/local/lib/python3.8/dist-packages/torch/utils/cpp_extension.py”, line 581, in unix_wrap_single_compile
#0 1.415 cflags = unix_cuda_flags(cflags)
#0 1.415 File “/usr/local/lib/python3.8/dist-packages/torch/utils/cpp_extension.py”, line 548, in unix_cuda_flags
#0 1.415 cflags + _get_cuda_arch_flags(cflags))
#0 1.415 File “/usr/local/lib/python3.8/dist-packages/torch/utils/cpp_extension.py”, line 1780, in _get_cuda_arch_flags
#0 1.416 arch_list[-1] += ‘+PTX’
#0 1.416 IndexError: list index out of range

However, Entering the same command inside a container instead of while building the image works. And I get no CUDA error.
But now if the code proceeds, I face the error :

Let’s use 2 GPUs! # That means it is detecting 2 GPUs in the system
loading 3D models
libEGL warning: DRI2: failed to create dri screen
libEGL warning: DRI2: failed to create dri screen
Unable to initialize EGL
Command ‘[’/deps/PoseCNN/tools/…/ycb_render/build/test_device’, ‘0’]’ returned non-zero exit status 1.
libEGL warning: DRI2: failed to create dri screen
libEGL warning: DRI2: failed to create dri screen
Unable to initialize EGL
Command ‘[’/deps/PoseCNN/tools/…/ycb_render/build/test_device’, ‘1’]’ returned non-zero exit status 1.
Traceback (most recent call last):
File “./tools/train_net.py”, line 141, in
cfg.renderer = YCBRenderer(width=cfg.TRAIN.SYN_WIDTH, height=cfg.TRAIN.SYN_HEIGHT, render_marker=False)
File “/deps/PoseCNN/tools/…/ycb_render/ycb_renderer.py”, line 88, in _ init _
self.r = CppYCBRenderer.CppYCBRenderer(width, height, get_available_devices()[gpu_id])
IndexError: list index out of range

which means its not able to identify devices during the build stage of the project and this function(get_available_devices()) is not able to identify the gpus

Inside the docker, running nvidia-smi and nvcc -V output:

NVIDIA-SMI 525.85.05 Driver Version: 525.85.05 CUDA Version: 12.0 …

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Jun__8_16:49:14_PDT_2022
Cuda compilation tools, release 11.7, V11.7.99
Build cuda_11.7.r11.7/compiler.31442593_0

Even Running deviceQuery from cuda samples repository Passes for all SMs(“50 52 60 61 70 75 80 86”)

Please share what could be the problem here. I have tried multiple images and installing multiple libraries but still there seems to be a problem in CUDA or OpenGL.

Thank you

MarkusHoHo · February 10, 2023, 1:13pm

Hi there dheeraj.singh,

A couple of notes that might help you along.

First of all, the standard CUDA containers like the one you are using do not support EGL and will not support GLX. Both of which I think are necessary for the rendering part of the PoseCNN code from NVLabs. That means you should look for the cudagl docker images. Development is on hold at the moment, so the latest is CUDA 11.4 on Ubuntu 20.04.

Regarding the CUDA not found in the first part, that is dependent on how you built your docker image. I am not a docker expert, so I am not able to help with that, but rebuilding the image might require additional configuration for the new image to correctly reference CUDA.

I suppose you know all our resources on Container images?

I hope this helps!

dheeraj.singh · February 10, 2023, 2:54pm

Hello Markus

Thank you very much for replying. I did not realise there are GL based image, now I have used cudagl image and the problem seems to be solved although with a reload:

Let’s use 2 GPUs!
loading 3D models
libEGL warning: DRI2: failed to create dri screen
libEGL warning: DRI2: failed to create dri screen
Unable to initialize EGL
Command ‘[’/deps/PoseCNN/tools/…/ycb_render/build/test_device’, ‘1’]’ returned non-zero exit status 1.
libEGL warning: DRI2: failed to create dri screen
libEGL warning: DRI2: failed to create dri screen
Unable to initialize EGL
Command ‘[’/deps/PoseCNN/tools/…/ycb_render/build/test_device’, ‘2’]’ returned non-zero exit status 1.
number of devices found 4
Loaded EGL 1.5 after reload.

And then the code runs forward

Regarding the first problem, I found out the issue that when running the setup.py file inside the Dockerfile, the system does not have any information about the CUDA architecture, so we need to specify the Compute Capability(CC) of the GPU in use, for me its RTX 4090 which was not present in the list given in the link provided in the solution of this Github issue: cuda does not install · Issue #71 · pytorch/extension-cpp · GitHub
However, I randomly considered the CC as 8.6(most recent) and it worked
So i added this line in the Dockerfile

ARG TORCH_CUDA_ARCH_LIST=“8.6+PTX”

Thank you for your help!

MarkusHoHo · February 13, 2023, 10:22am

Great to hear that you addressed this and thank you for sharing your solution!

system · February 27, 2023, 10:23am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.