Hello,
I use the GPU operator to deploy OpenGL applications in a Kubernetes Cluster. It works fine, my pod (a basic glxgears) starts and when running nvidia-smi from the nvidia-driver pod I can see that glxgears is using the GPU (and the metrics are consistent with the usage of hardware acceleration).
However, when looking within the container some things are unclear.
When I inspect the content of the container on my PC, ie without the GPU Operator, I can see that the installation of glx-utils package comes with some default openGL libraries (with mesa implementation) :
[root@9417bac3962a lib64]# ls -al /usr/lib64/ |grep libGL
lrwxrwxrwx 1 root root 14 Nov 11 2022 libGL.so.1 -> libGL.so.1.7.0
-rwxr-xr-x 1 root root 558944 Nov 11 2022 libGL.so.1.7.0
lrwxrwxrwx 1 root root 15 Nov 11 2022 libGLX.so.0 -> libGLX.so.0.0.0
-rwxr-xr-x 1 root root 141256 Nov 11 2022 libGLX.so.0.0.0
lrwxrwxrwx 1 root root 20 Nov 11 2022 libGLX_mesa.so.0 -> libGLX_mesa.so.0.0.0
-rwxr-xr-x 1 root root 502032 Nov 11 2022 libGLX_mesa.so.0.0.0
lrwxrwxrwx 1 root root 27 Nov 11 2022 libGLX_system.so.0 -> /usr/lib64/libGLX_mesa.so.0
lrwxrwxrwx 1 root root 22 Nov 11 2022 libGLdispatch.so.0 -> libGLdispatch.so.0.0.0
-rwxr-xr-x 1 root root 769048 Nov 11 2022 libGLdispatch.so.0.0.0
When running the same command within the container deployed in my cluster with the GPU Operator i have the following result :
[root@glxgears-glxgears-deployment-694bc49445-87kzr lib64]# ls -al /usr/lib64/ | grep libGL
lrwxrwxrwx. 1 root root 14 Nov 11 2022 libGL.so.1 -> libGL.so.1.7.0
-rwxr-xr-x. 1 root root 558944 Nov 11 2022 libGL.so.1.7.0
lrwxrwxrwx. 1 root root 33 Sep 9 15:40 libGLESv1_CM_nvidia.so.1 -> libGLESv1_CM_nvidia.so.550.107.02
-rwxr-xr-x. 1 root root 68000 Sep 6 13:36 libGLESv1_CM_nvidia.so.550.107.02
lrwxrwxrwx. 1 root root 30 Sep 9 15:40 libGLESv2_nvidia.so.2 -> libGLESv2_nvidia.so.550.107.02
-rwxr-xr-x. 1 root root 117144 Sep 6 13:36 libGLESv2_nvidia.so.550.107.02
lrwxrwxrwx. 1 root root 15 Nov 11 2022 libGLX.so.0 -> libGLX.so.0.0.0
-rwxr-xr-x. 1 root root 141256 Nov 11 2022 libGLX.so.0.0.0
lrwxrwxrwx. 1 root root 27 Sep 9 15:40 libGLX_indirect.so.0 -> libGLX_nvidia.so.550.107.02
lrwxrwxrwx. 1 root root 20 Nov 11 2022 libGLX_mesa.so.0 -> libGLX_mesa.so.0.0.0
-rwxr-xr-x. 1 root root 502032 Nov 11 2022 libGLX_mesa.so.0.0.0
lrwxrwxrwx. 1 root root 27 Sep 9 15:40 libGLX_nvidia.so.0 -> libGLX_nvidia.so.550.107.02
-rwxr-xr-x. 1 root root 1203776 Sep 6 13:36 libGLX_nvidia.so.550.107.02
lrwxrwxrwx. 1 root root 27 Nov 11 2022 libGLX_system.so.0 -> /usr/lib64/libGLX_mesa.so.0
lrwxrwxrwx. 1 root root 22 Nov 11 2022 libGLdispatch.so.0 -> libGLdispatch.so.0.0.0
-rwxr-xr-x. 1 root root 769048 Nov 11 2022 libGLdispatch.so.0.0.0
First observations :
- some new libraries are present (I guess mounted by the nvidia-container-toolkit)
- already present libraries are unchanged (size is identical)
If now I have a look to libraries present in the same container but in the volume shared with the host where libraries are installed :
[root@glxgears-glxgears-deployment-694bc49445-87kzr lib64]# ls -al /run/driver/lib/x86_64-linux-gnu/ | grep libGL
lrwxrwxrwx. 1 root root 10 Sep 6 13:36 libGL.so -> libGL.so.1
lrwxrwxrwx. 1 root root 14 Sep 6 13:36 libGL.so.1 -> libGL.so.1.7.0
-rwxr-xr-x. 1 root root 649416 Sep 6 13:36 libGL.so.1.7.0
lrwxrwxrwx. 1 root root 17 Sep 6 13:36 libGLESv1_CM.so -> libGLESv1_CM.so.1
lrwxrwxrwx. 1 root root 21 Sep 6 13:36 libGLESv1_CM.so.1 -> libGLESv1_CM.so.1.2.0
-rwxr-xr-x. 1 root root 43208 Sep 6 13:36 libGLESv1_CM.so.1.2.0
lrwxrwxrwx. 1 root root 33 Sep 6 13:36 libGLESv1_CM_nvidia.so.1 -> libGLESv1_CM_nvidia.so.550.107.02
-rwxr-xr-x. 1 root root 68000 Sep 6 13:36 libGLESv1_CM_nvidia.so.550.107.02
lrwxrwxrwx. 1 root root 14 Sep 6 13:36 libGLESv2.so -> libGLESv2.so.2
lrwxrwxrwx. 1 root root 18 Sep 6 13:36 libGLESv2.so.2 -> libGLESv2.so.2.1.0
-rwxr-xr-x. 1 root root 80064 Sep 6 13:36 libGLESv2.so.2.1.0
lrwxrwxrwx. 1 root root 30 Sep 6 13:36 libGLESv2_nvidia.so.2 -> libGLESv2_nvidia.so.550.107.02
-rwxr-xr-x. 1 root root 117144 Sep 6 13:36 libGLESv2_nvidia.so.550.107.02
lrwxrwxrwx. 1 root root 11 Sep 6 13:36 libGLX.so -> libGLX.so.0
-rwxr-xr-x. 1 root root 137616 Sep 6 13:36 libGLX.so.0
lrwxrwxrwx. 1 root root 27 Sep 6 13:36 libGLX_nvidia.so.0 -> libGLX_nvidia.so.550.107.02
-rwxr-xr-x. 1 root root 1203776 Sep 6 13:36 libGLX_nvidia.so.550.107.02
-rwxr-xr-x. 1 root root 952576 Sep 6 13:36 libGLdispatch.so.0
We can notice that for the following libraries :
- libGL
- libGLX
- libGLdispatch
The version present in /usr/lib64 is the one initially installed in the container and not the one mounted by the nvidia stack.
When looking to an extract of the links of the application :
[root@glxgears-glxgears-deployment-694bc49445-87kzr lib64]# ldd /usr/bin/glxgears
libGL.so.1 => /usr/lib64/libGL.so.1 (0x00007fa1142e7000)
libX11.so.6 => /usr/lib64/libX11.so.6 (0x00007fa113c22000)
libGLX.so.0 => /usr/lib64/libGLX.so.0 (0x00007fa11362b000)
libGLdispatch.so.0 => /usr/lib64/libGLdispatch.so.0 (0x00007fa113162000)
We can also see that the application is not linked with the libraries built with my version of the driver.
The configuration of my X-Server is pretty the same as it is started in another pod and I noticed exactly the same thing.
So my question is pretty basic : how can this work ? It seems that my application loads the mesa version of the libGL, libGLX & libGLdispatch, however the display is well rendered by the GPU. Am I missing something ? It would be great if I can find deep documentation of these mechanisms.
If necessary, I’m using the following versions :
GPU Operator : 23.6.1
Container toolkit : 1.13.4-ubuntu20.04
Thanks!
Regards,