Running Replicator 1.6.3 from inside of kubernetes (microk8s nor minikube)

So you can follow along I will give you my full background setup.

Follow setup minikube

I am Ubuntu 22.04.4 LTS

curl -LO https://storage.googleapis.com/minikube/releases/latest/minikube_latest_amd64.deb
sudo dpkg -i minikube_latest_amd64.deb

Use GPU with minikube see Using NVIDIA GPUs with minikube | minikube

NVIDIA Driver

Turn off harden stuffs

echo "net.core.bpf_jit_harden=0" | sudo tee -a /etc/sysctl.conf
sudo sysctl -p

NVIDIA runtime container
https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html

Docker config

sudo nvidia-ctk runtime configure --runtime=docker && sudo systemctl restart docker

run minikube

minikube start --driver docker --container-runtime docker --gpus all

then you can get Replicator 1.6.3 to run forever as a job with this script

more runforever.yml

apiVersion: batch/v1
kind: Job
metadata:
  name: awesome
spec:
  ttlSecondsAfterFinished: 900
  template:
    spec:

      restartPolicy: Never
      volumes:
        - name: shared-memory-hack
          emptyDir:
            medium: Memory
      containers:
        - name: omniverse-replicator
          image: nvcr.io/nvidia/omniverse-replicator:1.6.3
          # Just spin & wait forever
          imagePullPolicy: Never
          command: [ "/bin/bash", "-c", "--" ]
          args: [ "while true; do sleep 30; done;" ]
          stdin: true
          tty: true        
          resources:
            limits:
              nvidia.com/gpu: 1
              memory: 20Gi
              cpu: 4000m
            requests:
              memory: 20Gi
              cpu: 4000m

Then to spawn this on minikube

Just do kubectl create -f runforever.yml

you can check everything is cool with kubectl get pods -A

then you will see something that is awesome-XyAb6 where your number will be different

to get inside the container running in that pod you just need to do

kubectl exec --stdin --tty awesome-XyAb6 -- /bin/bash but using your number after the awesome from listing with the kubectl get pods -A

the command I issued is /startup.sh --/app/printConfig=true

Because startup.sh is mounted at / in the container and that are should print the config

but it fails as follows

root@awesome-2rdtb:/# /startup.sh --/app/printConfig=true

Fatal Error: Can't find libGLX_nvidia.so.0...

Ensure running with NVIDIA runtime. (--gpus all) or (--runtime nvidia)

but to be sure I check the GPU is present with the trusty nvidia-smi command

root@awesome-2rdtb:/# nvidia-smi 
Wed Mar 27 06:13:47 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.161.07             Driver Version: 535.161.07   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA RTX A6000               Off | 00000000:01:00.0  On |                  Off |
| 30%   40C    P8              33W / 300W |    568MiB / 49140MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
+---------------------------------------------------------------------------------------+

So why does this not work?

Now you are done you can remove the pod with kubectl delete -f runforever.yml

OK its worse than that

microk8s has a similar problem with a base image from NVIDIA

Docker is OK

docker run --rm --gpus 1 nvidia/vulkan:1.2.133-450 vulkaninfo | head
'DISPLAY' environment variable not set... skipping surface info
error: XDG_RUNTIME_DIR not set in the environment.
==========
VULKANINFO
==========

Vulkan Instance Version: 1.2.131


Instance Extensions: count = 19
====================
	VK_EXT_acquire_xlib_display            : extension revision 1
write /dev/stdout: broken pipe

microk8s is OK(ish)

from this input

microk8s kubectl apply -f - <<EOF
apiVersion: v1
kind: Pod
metadata:
  name: cuda-vector-add
spec:
  restartPolicy: OnFailure
  containers:
    - name: cuda-vector-add
      image: "k8s.gcr.io/cuda-vector-add:v0.1"
      resources:
        limits:
          nvidia.com/gpu: 1
EOF

I get this output, which says NVIDIA driver is working OK, as CUDA has executed some kernel code in microk8s environment

microk8s kubectl logs cuda-vector-add
[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done

but from

microk8s kubectl apply -f - <<EOF
apiVersion: batch/v1
kind: Job
metadata:
  name: awesome
spec:
  ttlSecondsAfterFinished: 900
  template:
    spec:
      restartPolicy: Never
      volumes:
        - name: shared-memory-hack
          emptyDir:
            medium: Memory
      containers:
        - name: omniverse-replicator
          image: nvidia/vulkan:1.2.133-450
          command: [ vulkaninfo ]
          imagePullPolicy: Always
          env:
            - name: NVIDIA_DRIVER_CAPABILITIES
              value:  all
            - name: NVIDIA_VISIBLE_DEVICES
              value: all
          resources:
            limits:
              nvidia.com/gpu: 1
              memory: 20Gi
              cpu: 4000m
            requests:
              memory: 20Gi
              cpu: 4000m
EOF

I get

ERROR: [Loader Message] Code 0 : libnvidia-gpucomp.so.545.29.06: cannot open shared object file: No such file or directory
Cannot create Vulkan instance.
This problem is often caused by a faulty installation of the Vulkan driver or attempting to use a GPU that does not support Vulkan.
/root/sdk-build/1.2.131.2/source/Vulkan-Tools/vulkaninfo/vulkaninfo.h:371: failed with ERROR_INCOMPATIBLE_DRIVER

from this

microk8s kubectl describe pods/nvidia-container-toolkit-daemonset-${PODNUM} -n gpu-operator-resources

I get the output

Name:                 nvidia-container-toolkit-daemonset-fqsgx
Namespace:            gpu-operator-resources
Priority:             2000001000
Priority Class Name:  system-node-critical
Service Account:      nvidia-container-toolkit
Node:                 skylab/10.90.184.153
Start Time:           Thu, 28 Mar 2024 15:53:11 +1030
Labels:               app=nvidia-container-toolkit-daemonset
                      controller-revision-hash=588df5b4d
                      pod-template-generation=1
Annotations:          cni.projectcalico.org/containerID: d1a9b8cdad3912d9209a22a6b40e92b0ba9b230330408ca5247b5952d5b2c972
                      cni.projectcalico.org/podIP: 10.1.16.219/32
                      cni.projectcalico.org/podIPs: 10.1.16.219/32
Status:               Running
IP:                   10.1.16.219
IPs:
  IP:           10.1.16.219
Controlled By:  DaemonSet/nvidia-container-toolkit-daemonset
Init Containers:
  driver-validation:
    Container ID:  containerd://bc62340a7e51b140207537ee8059fa4f733bc51c36086146e1e16196303ee766
    Image:         nvcr.io/nvidia/cloud-native/gpu-operator-validator:v22.9.1
    Image ID:      nvcr.io/nvidia/cloud-native/gpu-operator-validator@sha256:18c9ea88ae06d479e6657b8a4126a8ee3f4300a40c16ddc29fb7ab3763d46005
    Port:          <none>
    Host Port:     <none>
    Command:
      sh
      -c
    Args:
      nvidia-validator
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Thu, 28 Mar 2024 16:15:14 +1030
      Finished:     Thu, 28 Mar 2024 16:15:14 +1030
    Ready:          True
    Restart Count:  1
    Environment:
      WITH_WAIT:  true
      COMPONENT:  driver
    Mounts:
      /host from host-root (ro)
      /run/nvidia/driver from driver-install-path (rw)
      /run/nvidia/validations from run-nvidia-validations (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-9fqjh (ro)
Containers:
  nvidia-container-toolkit-ctr:
    Container ID:  containerd://bcc06b6954fb25750d5d40c038a11120c736cfb0528d9ace429d2745f89706ba
    Image:         nvcr.io/nvidia/k8s/container-toolkit:v1.11.0-ubuntu20.04
    Image ID:      nvcr.io/nvidia/k8s/container-toolkit@sha256:c4ed1e0345f6d3e2eec37601f97495778ae54dc407db08ebead22a580ec96542
    Port:          <none>
    Host Port:     <none>
    Command:
      bash
      -c
    Args:
      [[ -f /run/nvidia/validations/host-driver-ready ]] && driver_root=/ || driver_root=/run/nvidia/driver; export NVIDIA_DRIVER_ROOT=$driver_root; sleep 5; exec nvidia-toolkit /usr/local/nvidia
    State:          Running
      Started:      Thu, 28 Mar 2024 16:15:15 +1030
    Last State:     Terminated
      Reason:       Unknown
      Exit Code:    255
      Started:      Thu, 28 Mar 2024 15:53:29 +1030
      Finished:     Thu, 28 Mar 2024 16:15:03 +1030
    Ready:          True
    Restart Count:  1
    Environment:
      RUNTIME_ARGS:               --socket /runtime/sock-dir/containerd.sock --config /runtime/config-dir/containerd-template.toml
      CONTAINERD_CONFIG:          /var/snap/microk8s/current/args/containerd-template.toml
      CONTAINERD_SOCKET:          /var/snap/microk8s/common/run/containerd.sock
      CONTAINERD_SET_AS_DEFAULT:  1
      RUNTIME:                    containerd
      CONTAINERD_RUNTIME_CLASS:   nvidia
    Mounts:
      /host from host-root (ro)
      /run/nvidia from nvidia-run-path (rw)
      /runtime/config-dir/ from containerd-config (rw)
      /runtime/sock-dir/ from containerd-socket (rw)
      /usr/local/nvidia from toolkit-install-dir (rw)
      /usr/share/containers/oci/hooks.d from crio-hooks (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-9fqjh (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             True 
  ContainersReady   True 
  PodScheduled      True 
Volumes:
  nvidia-run-path:
    Type:          HostPath (bare host directory volume)
    Path:          /run/nvidia
    HostPathType:  DirectoryOrCreate
  run-nvidia-validations:
    Type:          HostPath (bare host directory volume)
    Path:          /run/nvidia/validations
    HostPathType:  DirectoryOrCreate
  driver-install-path:
    Type:          HostPath (bare host directory volume)
    Path:          /run/nvidia/driver
    HostPathType:  
  host-root:
    Type:          HostPath (bare host directory volume)
    Path:          /
    HostPathType:  
  toolkit-install-dir:
    Type:          HostPath (bare host directory volume)
    Path:          /usr/local/nvidia
    HostPathType:  
  crio-hooks:
    Type:          HostPath (bare host directory volume)
    Path:          /run/containers/oci/hooks.d
    HostPathType:  
  containerd-config:
    Type:          HostPath (bare host directory volume)
    Path:          /var/snap/microk8s/current/args
    HostPathType:  
  containerd-socket:
    Type:          HostPath (bare host directory volume)
    Path:          /var/snap/microk8s/common/run
    HostPathType:  
  kube-api-access-9fqjh:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              nvidia.com/gpu.deploy.container-toolkit=true
Tolerations:                 node.kubernetes.io/disk-pressure:NoSchedule op=Exists
                             node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists
                             node.kubernetes.io/pid-pressure:NoSchedule op=Exists
                             node.kubernetes.io/unreachable:NoExecute op=Exists
                             node.kubernetes.io/unschedulable:NoSchedule op=Exists
                             nvidia.com/gpu:NoSchedule op=Exists
Events:                      <none>

It feels like there are no validations for vulkan

Also this snippet

  host-root:
    Type:          HostPath (bare host directory volume)
    Path:          /

Leave me feeling cold, because I can see libnvidia-gpucomp.so.545.29.06 sitting right there at

ls /usr/lib/x86_64-linux-gnu/*.so* | grep nvidia

/usr/lib/x86_64-linux-gnu/libnvidia-gpucomp.so.545.29.06

/usr/lib/x86_64-linux-gnu/libEGL_nvidia.so.0
/usr/lib/x86_64-linux-gnu/libEGL_nvidia.so.515.105.01
/usr/lib/x86_64-linux-gnu/libEGL_nvidia.so.545.29.06
/usr/lib/x86_64-linux-gnu/libGLESv1_CM_nvidia.so.1
/usr/lib/x86_64-linux-gnu/libGLESv1_CM_nvidia.so.515.105.01
/usr/lib/x86_64-linux-gnu/libGLESv1_CM_nvidia.so.545.29.06
/usr/lib/x86_64-linux-gnu/libGLESv2_nvidia.so.2
/usr/lib/x86_64-linux-gnu/libGLESv2_nvidia.so.515.105.01
/usr/lib/x86_64-linux-gnu/libGLESv2_nvidia.so.545.29.06
/usr/lib/x86_64-linux-gnu/libGLX_nvidia.so.0
/usr/lib/x86_64-linux-gnu/libGLX_nvidia.so.515.105.01
/usr/lib/x86_64-linux-gnu/libGLX_nvidia.so.545.29.06
/usr/lib/x86_64-linux-gnu/libnvidia-allocator.so
/usr/lib/x86_64-linux-gnu/libnvidia-allocator.so.1
/usr/lib/x86_64-linux-gnu/libnvidia-allocator.so.515.105.01
/usr/lib/x86_64-linux-gnu/libnvidia-allocator.so.545.29.06
/usr/lib/x86_64-linux-gnu/libnvidia-api.so.1
/usr/lib/x86_64-linux-gnu/libnvidia-cfg.so
/usr/lib/x86_64-linux-gnu/libnvidia-cfg.so.1
/usr/lib/x86_64-linux-gnu/libnvidia-cfg.so.515.105.01
/usr/lib/x86_64-linux-gnu/libnvidia-cfg.so.545.29.06
/usr/lib/x86_64-linux-gnu/libnvidia-compiler.so.515.105.01
/usr/lib/x86_64-linux-gnu/libnvidia-container-go.so.1
/usr/lib/x86_64-linux-gnu/libnvidia-container-go.so.1.14.6
/usr/lib/x86_64-linux-gnu/libnvidia-container.so.1
/usr/lib/x86_64-linux-gnu/libnvidia-container.so.1.14.6
/usr/lib/x86_64-linux-gnu/libnvidia-eglcore.so.515.105.01
/usr/lib/x86_64-linux-gnu/libnvidia-eglcore.so.545.29.06
/usr/lib/x86_64-linux-gnu/libnvidia-egl-gbm.so.1
/usr/lib/x86_64-linux-gnu/libnvidia-egl-gbm.so.1.1.0
/usr/lib/x86_64-linux-gnu/libnvidia-encode.so
/usr/lib/x86_64-linux-gnu/libnvidia-encode.so.1
/usr/lib/x86_64-linux-gnu/libnvidia-encode.so.515.105.01
/usr/lib/x86_64-linux-gnu/libnvidia-encode.so.545.29.06
/usr/lib/x86_64-linux-gnu/libnvidia-fbc.so
/usr/lib/x86_64-linux-gnu/libnvidia-fbc.so.1
/usr/lib/x86_64-linux-gnu/libnvidia-fbc.so.515.105.01
/usr/lib/x86_64-linux-gnu/libnvidia-fbc.so.545.29.06
/usr/lib/x86_64-linux-gnu/libnvidia-glcore.so.515.105.01
/usr/lib/x86_64-linux-gnu/libnvidia-glcore.so.545.29.06
/usr/lib/x86_64-linux-gnu/libnvidia-glsi.so.515.105.01
/usr/lib/x86_64-linux-gnu/libnvidia-glsi.so.545.29.06
/usr/lib/x86_64-linux-gnu/libnvidia-glvkspirv.so.515.105.01
/usr/lib/x86_64-linux-gnu/libnvidia-glvkspirv.so.545.29.06
/usr/lib/x86_64-linux-gnu/libnvidia-gpucomp.so.545.29.06
/usr/lib/x86_64-linux-gnu/libnvidia-ml.so
/usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1
/usr/lib/x86_64-linux-gnu/libnvidia-ml.so.515.105.01
/usr/lib/x86_64-linux-gnu/libnvidia-ml.so.545.29.06
/usr/lib/x86_64-linux-gnu/libnvidia-ngx.so.1
/usr/lib/x86_64-linux-gnu/libnvidia-ngx.so.515.105.01
/usr/lib/x86_64-linux-gnu/libnvidia-ngx.so.545.29.06
/usr/lib/x86_64-linux-gnu/libnvidia-nvvm.so
/usr/lib/x86_64-linux-gnu/libnvidia-nvvm.so.4
/usr/lib/x86_64-linux-gnu/libnvidia-nvvm.so.515.105.01
/usr/lib/x86_64-linux-gnu/libnvidia-nvvm.so.545.29.06
/usr/lib/x86_64-linux-gnu/libnvidia-opencl.so.1
/usr/lib/x86_64-linux-gnu/libnvidia-opencl.so.515.105.01
/usr/lib/x86_64-linux-gnu/libnvidia-opencl.so.545.29.06
/usr/lib/x86_64-linux-gnu/libnvidia-opticalflow.so
/usr/lib/x86_64-linux-gnu/libnvidia-opticalflow.so.1
/usr/lib/x86_64-linux-gnu/libnvidia-opticalflow.so.515.105.01
/usr/lib/x86_64-linux-gnu/libnvidia-opticalflow.so.545.29.06
/usr/lib/x86_64-linux-gnu/libnvidia-pkcs11-openssl3.so.545.29.06
/usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so
/usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.1
/usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.515.105.01
/usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.545.29.06
/usr/lib/x86_64-linux-gnu/libnvidia-rtcore.so.515.105.01
/usr/lib/x86_64-linux-gnu/libnvidia-rtcore.so.545.29.06
/usr/lib/x86_64-linux-gnu/libnvidia-tls.so.515.105.01
/usr/lib/x86_64-linux-gnu/libnvidia-tls.so.545.29.06
/usr/lib/x86_64-linux-gnu/libnvidia-vulkan-producer.so.515.105.01

So I guess this becomes a microk8s support question really.

How can I get host_path to be /usr/lib/x86_64-linux-gnu, when starting up minikube add gpu which installs all the helm charts for the gpu operator?

Looks like the answer is hidden here somewhere

https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/getting-started.html#pre-installed-nvidia-gpu-drivers

Looking here

and

https://microk8s.io/docs/addon-gpu#use-host-drivers-and-runtime-4

Have it covered hopefully

Seems to be Ubuntu 20.04 rather than 22.04 but other than that it fits the bill

this was the missing piece

The pivotal parts were as follows.

When using microk8s on a local machine for development, it is important to use the local driver and local version of the nvidia-container-runtime and specify for microk8s to use these resources passed down to the containerd used by microk8s.

This is actually documented here:
https://microk8s.io/docs/addon-gpu#use-host-drivers-and-runtime-4

but if you have already tried something else, you need to make sure you have done

microk8s disable gpu to uninstall previous attempts.

from this point you can make sure all of the the stuff with the local install of the NVIDIA driver and NVIDIA container runtime is working as expected and that the

/var/snap/microk8s/current/args/containerd-template.toml

contains


        [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
          runtime_type = "io.containerd.runc.v2"

          [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
            BinaryName = "/usr/bin/nvidia-container-runtime"

and also check that nvidia-container-runtime is indeed at

/usr/bin/nvidia-container-runtime

from here you can do

sudo microk8s stop
sudo microk8s start
microk8s disable gpu
sudo microk8s stop
sudo microk8s start
microk8s enable gpu --driver host --set toolkit.enabled=false

At this point the folllowing base CUDA example should work

microk8s kubectl apply -f - <<EOF
apiVersion: v1
kind: Pod
metadata:
  name: cuda-vector-add
spec:
  runtimeClassName: nvidia
  restartPolicy: OnFailure
  containers:
    - name: cuda-vector-add
      env:
      - name: NVIDIA_VISIBLE_DEVICES
        value: all
      - name: NVIDIA_DRIVER_CAPABILITIES
        value: compute,utility
      # https://github.com/kubernetes/kubernetes/blob/v1.7.11/test/images/nvidia-cuda/Dockerfile
      image: "k8s.gcr.io/cuda-vector-add:v0.1"
      resources:
        limits:
          nvidia.com/gpu: 1
EOF

The key here is

  runtimeClassName: nvidia

from here you can level up to an NVIDIA Vulkan example

microk8s kubectl apply -f - <<EOF
apiVersion: batch/v1
kind: Job
metadata:
  name: vulkantest
spec:
  ttlSecondsAfterFinished: 900
  template:
    spec:
      runtimeClassName: nvidia
      restartPolicy: Never
      volumes:
        - name: shared-memory-hack
          emptyDir:
            medium: Memory
      containers:
        - name: omniverse-replicator
          image: nvidia/vulkan:1.2.133-450
          command: [ vulkaninfo ]
          imagePullPolicy: Always
          env:
            - name: NVIDIA_DRIVER_CAPABILITIES
              value:  all
            - name: NVIDIA_VISIBLE_DEVICES
              value: all
          resources:
            limits:
              nvidia.com/gpu: 1
              memory: 20Gi
              cpu: 4000m
            requests:
              memory: 20Gi
              cpu: 4000m
EOF
1 Like