Overview
We are trying to create and start a pod that includes a container accessing the GPU using podman. However, an error occurs in Chapter 4. Please tell me the countermeasures.
Host machine settings
- Install CUDAToolkit12.2
- Install NVIDIA Container Toolkit
- Generate the CDI Specification file for Podman
Creating the container and starting by podman
- Create 6 types of containers using dockerfile. These containers include applications that use the GPU. These containers have a track record of running in a k8s environment.
- Create a .yaml file to start with podman
Here is a sample of the yaml. All 6 types have the almost same yaml file.
apiVersion: v1
kind: Pod
metadata:
name: test-pod
spec:
restartPolicy: OnFailure
containers:
- name: test
image: localhost/test:latest
securityContext:
privileged: true
volumeMounts:
- name: key-shm
mountPath: /dev/shm/
device:
- nvidia-gpu
volumes:
- name: key-shm
hostPath:
path: /dev/shm/
type: Directory
- Start with the podman command
I started it using the following command: podman play kube test.yaml
- Error occurrence
The error content varies depending on the created container.
Pod:
3e9afb45b9047c9a0f6b0d511b59e8d480794fcf4909804c93e9df72e8d4fd06
Container:
106d1afed2092c329aa8329a6ceeb6ccb2e48a1fb58052e2cf21b39bfe9d3a4d
error starting container 106d1afed2092c329aa8329a6ceeb6ccb2e48a1fb58052e2cf21b39bfe9d3a4d: container_linux.g
o:380: starting container process caused: process_linux.go:545: container init caused: Running hook #0:: err
or running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli: requirement error: invalid expression: OCI runtime error
./test: error while loading shared libraries: libcuda.so.1: cannot open shared object file: No such file or directory