OK its worse than that
microk8s has a similar problem with a base image from NVIDIA
Docker is OK
docker run --rm --gpus 1 nvidia/vulkan:1.2.133-450 vulkaninfo | head
'DISPLAY' environment variable not set... skipping surface info
error: XDG_RUNTIME_DIR not set in the environment.
==========
VULKANINFO
==========
Vulkan Instance Version: 1.2.131
Instance Extensions: count = 19
====================
VK_EXT_acquire_xlib_display : extension revision 1
write /dev/stdout: broken pipe
microk8s is OK(ish)
from this input
microk8s kubectl apply -f - <<EOF
apiVersion: v1
kind: Pod
metadata:
name: cuda-vector-add
spec:
restartPolicy: OnFailure
containers:
- name: cuda-vector-add
image: "k8s.gcr.io/cuda-vector-add:v0.1"
resources:
limits:
nvidia.com/gpu: 1
EOF
I get this output, which says NVIDIA driver is working OK, as CUDA has executed some kernel code in microk8s environment
microk8s kubectl logs cuda-vector-add
[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done
but from
microk8s kubectl apply -f - <<EOF
apiVersion: batch/v1
kind: Job
metadata:
name: awesome
spec:
ttlSecondsAfterFinished: 900
template:
spec:
restartPolicy: Never
volumes:
- name: shared-memory-hack
emptyDir:
medium: Memory
containers:
- name: omniverse-replicator
image: nvidia/vulkan:1.2.133-450
command: [ vulkaninfo ]
imagePullPolicy: Always
env:
- name: NVIDIA_DRIVER_CAPABILITIES
value: all
- name: NVIDIA_VISIBLE_DEVICES
value: all
resources:
limits:
nvidia.com/gpu: 1
memory: 20Gi
cpu: 4000m
requests:
memory: 20Gi
cpu: 4000m
EOF
I get
ERROR: [Loader Message] Code 0 : libnvidia-gpucomp.so.545.29.06: cannot open shared object file: No such file or directory
Cannot create Vulkan instance.
This problem is often caused by a faulty installation of the Vulkan driver or attempting to use a GPU that does not support Vulkan.
/root/sdk-build/1.2.131.2/source/Vulkan-Tools/vulkaninfo/vulkaninfo.h:371: failed with ERROR_INCOMPATIBLE_DRIVER
from this
microk8s kubectl describe pods/nvidia-container-toolkit-daemonset-${PODNUM} -n gpu-operator-resources
I get the output
Name: nvidia-container-toolkit-daemonset-fqsgx
Namespace: gpu-operator-resources
Priority: 2000001000
Priority Class Name: system-node-critical
Service Account: nvidia-container-toolkit
Node: skylab/10.90.184.153
Start Time: Thu, 28 Mar 2024 15:53:11 +1030
Labels: app=nvidia-container-toolkit-daemonset
controller-revision-hash=588df5b4d
pod-template-generation=1
Annotations: cni.projectcalico.org/containerID: d1a9b8cdad3912d9209a22a6b40e92b0ba9b230330408ca5247b5952d5b2c972
cni.projectcalico.org/podIP: 10.1.16.219/32
cni.projectcalico.org/podIPs: 10.1.16.219/32
Status: Running
IP: 10.1.16.219
IPs:
IP: 10.1.16.219
Controlled By: DaemonSet/nvidia-container-toolkit-daemonset
Init Containers:
driver-validation:
Container ID: containerd://bc62340a7e51b140207537ee8059fa4f733bc51c36086146e1e16196303ee766
Image: nvcr.io/nvidia/cloud-native/gpu-operator-validator:v22.9.1
Image ID: nvcr.io/nvidia/cloud-native/gpu-operator-validator@sha256:18c9ea88ae06d479e6657b8a4126a8ee3f4300a40c16ddc29fb7ab3763d46005
Port: <none>
Host Port: <none>
Command:
sh
-c
Args:
nvidia-validator
State: Terminated
Reason: Completed
Exit Code: 0
Started: Thu, 28 Mar 2024 16:15:14 +1030
Finished: Thu, 28 Mar 2024 16:15:14 +1030
Ready: True
Restart Count: 1
Environment:
WITH_WAIT: true
COMPONENT: driver
Mounts:
/host from host-root (ro)
/run/nvidia/driver from driver-install-path (rw)
/run/nvidia/validations from run-nvidia-validations (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-9fqjh (ro)
Containers:
nvidia-container-toolkit-ctr:
Container ID: containerd://bcc06b6954fb25750d5d40c038a11120c736cfb0528d9ace429d2745f89706ba
Image: nvcr.io/nvidia/k8s/container-toolkit:v1.11.0-ubuntu20.04
Image ID: nvcr.io/nvidia/k8s/container-toolkit@sha256:c4ed1e0345f6d3e2eec37601f97495778ae54dc407db08ebead22a580ec96542
Port: <none>
Host Port: <none>
Command:
bash
-c
Args:
[[ -f /run/nvidia/validations/host-driver-ready ]] && driver_root=/ || driver_root=/run/nvidia/driver; export NVIDIA_DRIVER_ROOT=$driver_root; sleep 5; exec nvidia-toolkit /usr/local/nvidia
State: Running
Started: Thu, 28 Mar 2024 16:15:15 +1030
Last State: Terminated
Reason: Unknown
Exit Code: 255
Started: Thu, 28 Mar 2024 15:53:29 +1030
Finished: Thu, 28 Mar 2024 16:15:03 +1030
Ready: True
Restart Count: 1
Environment:
RUNTIME_ARGS: --socket /runtime/sock-dir/containerd.sock --config /runtime/config-dir/containerd-template.toml
CONTAINERD_CONFIG: /var/snap/microk8s/current/args/containerd-template.toml
CONTAINERD_SOCKET: /var/snap/microk8s/common/run/containerd.sock
CONTAINERD_SET_AS_DEFAULT: 1
RUNTIME: containerd
CONTAINERD_RUNTIME_CLASS: nvidia
Mounts:
/host from host-root (ro)
/run/nvidia from nvidia-run-path (rw)
/runtime/config-dir/ from containerd-config (rw)
/runtime/sock-dir/ from containerd-socket (rw)
/usr/local/nvidia from toolkit-install-dir (rw)
/usr/share/containers/oci/hooks.d from crio-hooks (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-9fqjh (ro)
Conditions:
Type Status
Initialized True
Ready True
ContainersReady True
PodScheduled True
Volumes:
nvidia-run-path:
Type: HostPath (bare host directory volume)
Path: /run/nvidia
HostPathType: DirectoryOrCreate
run-nvidia-validations:
Type: HostPath (bare host directory volume)
Path: /run/nvidia/validations
HostPathType: DirectoryOrCreate
driver-install-path:
Type: HostPath (bare host directory volume)
Path: /run/nvidia/driver
HostPathType:
host-root:
Type: HostPath (bare host directory volume)
Path: /
HostPathType:
toolkit-install-dir:
Type: HostPath (bare host directory volume)
Path: /usr/local/nvidia
HostPathType:
crio-hooks:
Type: HostPath (bare host directory volume)
Path: /run/containers/oci/hooks.d
HostPathType:
containerd-config:
Type: HostPath (bare host directory volume)
Path: /var/snap/microk8s/current/args
HostPathType:
containerd-socket:
Type: HostPath (bare host directory volume)
Path: /var/snap/microk8s/common/run
HostPathType:
kube-api-access-9fqjh:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: BestEffort
Node-Selectors: nvidia.com/gpu.deploy.container-toolkit=true
Tolerations: node.kubernetes.io/disk-pressure:NoSchedule op=Exists
node.kubernetes.io/memory-pressure:NoSchedule op=Exists
node.kubernetes.io/not-ready:NoExecute op=Exists
node.kubernetes.io/pid-pressure:NoSchedule op=Exists
node.kubernetes.io/unreachable:NoExecute op=Exists
node.kubernetes.io/unschedulable:NoSchedule op=Exists
nvidia.com/gpu:NoSchedule op=Exists
Events: <none>
It feels like there are no validations for vulkan
Also this snippet
host-root:
Type: HostPath (bare host directory volume)
Path: /
Leave me feeling cold, because I can see libnvidia-gpucomp.so.545.29.06
sitting right there at
ls /usr/lib/x86_64-linux-gnu/*.so* | grep nvidia
/usr/lib/x86_64-linux-gnu/libnvidia-gpucomp.so.545.29.06
/usr/lib/x86_64-linux-gnu/libEGL_nvidia.so.0
/usr/lib/x86_64-linux-gnu/libEGL_nvidia.so.515.105.01
/usr/lib/x86_64-linux-gnu/libEGL_nvidia.so.545.29.06
/usr/lib/x86_64-linux-gnu/libGLESv1_CM_nvidia.so.1
/usr/lib/x86_64-linux-gnu/libGLESv1_CM_nvidia.so.515.105.01
/usr/lib/x86_64-linux-gnu/libGLESv1_CM_nvidia.so.545.29.06
/usr/lib/x86_64-linux-gnu/libGLESv2_nvidia.so.2
/usr/lib/x86_64-linux-gnu/libGLESv2_nvidia.so.515.105.01
/usr/lib/x86_64-linux-gnu/libGLESv2_nvidia.so.545.29.06
/usr/lib/x86_64-linux-gnu/libGLX_nvidia.so.0
/usr/lib/x86_64-linux-gnu/libGLX_nvidia.so.515.105.01
/usr/lib/x86_64-linux-gnu/libGLX_nvidia.so.545.29.06
/usr/lib/x86_64-linux-gnu/libnvidia-allocator.so
/usr/lib/x86_64-linux-gnu/libnvidia-allocator.so.1
/usr/lib/x86_64-linux-gnu/libnvidia-allocator.so.515.105.01
/usr/lib/x86_64-linux-gnu/libnvidia-allocator.so.545.29.06
/usr/lib/x86_64-linux-gnu/libnvidia-api.so.1
/usr/lib/x86_64-linux-gnu/libnvidia-cfg.so
/usr/lib/x86_64-linux-gnu/libnvidia-cfg.so.1
/usr/lib/x86_64-linux-gnu/libnvidia-cfg.so.515.105.01
/usr/lib/x86_64-linux-gnu/libnvidia-cfg.so.545.29.06
/usr/lib/x86_64-linux-gnu/libnvidia-compiler.so.515.105.01
/usr/lib/x86_64-linux-gnu/libnvidia-container-go.so.1
/usr/lib/x86_64-linux-gnu/libnvidia-container-go.so.1.14.6
/usr/lib/x86_64-linux-gnu/libnvidia-container.so.1
/usr/lib/x86_64-linux-gnu/libnvidia-container.so.1.14.6
/usr/lib/x86_64-linux-gnu/libnvidia-eglcore.so.515.105.01
/usr/lib/x86_64-linux-gnu/libnvidia-eglcore.so.545.29.06
/usr/lib/x86_64-linux-gnu/libnvidia-egl-gbm.so.1
/usr/lib/x86_64-linux-gnu/libnvidia-egl-gbm.so.1.1.0
/usr/lib/x86_64-linux-gnu/libnvidia-encode.so
/usr/lib/x86_64-linux-gnu/libnvidia-encode.so.1
/usr/lib/x86_64-linux-gnu/libnvidia-encode.so.515.105.01
/usr/lib/x86_64-linux-gnu/libnvidia-encode.so.545.29.06
/usr/lib/x86_64-linux-gnu/libnvidia-fbc.so
/usr/lib/x86_64-linux-gnu/libnvidia-fbc.so.1
/usr/lib/x86_64-linux-gnu/libnvidia-fbc.so.515.105.01
/usr/lib/x86_64-linux-gnu/libnvidia-fbc.so.545.29.06
/usr/lib/x86_64-linux-gnu/libnvidia-glcore.so.515.105.01
/usr/lib/x86_64-linux-gnu/libnvidia-glcore.so.545.29.06
/usr/lib/x86_64-linux-gnu/libnvidia-glsi.so.515.105.01
/usr/lib/x86_64-linux-gnu/libnvidia-glsi.so.545.29.06
/usr/lib/x86_64-linux-gnu/libnvidia-glvkspirv.so.515.105.01
/usr/lib/x86_64-linux-gnu/libnvidia-glvkspirv.so.545.29.06
/usr/lib/x86_64-linux-gnu/libnvidia-gpucomp.so.545.29.06
/usr/lib/x86_64-linux-gnu/libnvidia-ml.so
/usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1
/usr/lib/x86_64-linux-gnu/libnvidia-ml.so.515.105.01
/usr/lib/x86_64-linux-gnu/libnvidia-ml.so.545.29.06
/usr/lib/x86_64-linux-gnu/libnvidia-ngx.so.1
/usr/lib/x86_64-linux-gnu/libnvidia-ngx.so.515.105.01
/usr/lib/x86_64-linux-gnu/libnvidia-ngx.so.545.29.06
/usr/lib/x86_64-linux-gnu/libnvidia-nvvm.so
/usr/lib/x86_64-linux-gnu/libnvidia-nvvm.so.4
/usr/lib/x86_64-linux-gnu/libnvidia-nvvm.so.515.105.01
/usr/lib/x86_64-linux-gnu/libnvidia-nvvm.so.545.29.06
/usr/lib/x86_64-linux-gnu/libnvidia-opencl.so.1
/usr/lib/x86_64-linux-gnu/libnvidia-opencl.so.515.105.01
/usr/lib/x86_64-linux-gnu/libnvidia-opencl.so.545.29.06
/usr/lib/x86_64-linux-gnu/libnvidia-opticalflow.so
/usr/lib/x86_64-linux-gnu/libnvidia-opticalflow.so.1
/usr/lib/x86_64-linux-gnu/libnvidia-opticalflow.so.515.105.01
/usr/lib/x86_64-linux-gnu/libnvidia-opticalflow.so.545.29.06
/usr/lib/x86_64-linux-gnu/libnvidia-pkcs11-openssl3.so.545.29.06
/usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so
/usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.1
/usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.515.105.01
/usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.545.29.06
/usr/lib/x86_64-linux-gnu/libnvidia-rtcore.so.515.105.01
/usr/lib/x86_64-linux-gnu/libnvidia-rtcore.so.545.29.06
/usr/lib/x86_64-linux-gnu/libnvidia-tls.so.515.105.01
/usr/lib/x86_64-linux-gnu/libnvidia-tls.so.545.29.06
/usr/lib/x86_64-linux-gnu/libnvidia-vulkan-producer.so.515.105.01
So I guess this becomes a microk8s support question really.
How can I get host_path
to be /usr/lib/x86_64-linux-gnu
, when starting up minikube add gpu
which installs all the helm charts for the gpu operator?