Thank you for your comments.
If you have any comment it would be appreciate.
I install following instructions by using GPU-Operator on Ubuntu 20.04
(Currantly I use AWS EC2, It is easily reproducible)
It seems nvidia-container-toolkit problem. (from seeing docker case)
1)CUDA Setup
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/cuda-ubuntu2004.pin
sudo mv cuda-ubuntu2004.pin /etc/apt/preferences.d/cuda-repository-pin-600
sudo apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/7fa2af80.pub
sudo add-apt-repository “deb Index of /compute/cuda/repos/ubuntu2004/x86_64 /”
sudo apt-get update
sudo apt-get -y install cuda
2)Kubernetes Inatallation
Following script intentionally remove # since this comment window works as markdown character.
sudo swapoff -a
sudo bash -c ‘cat > /etc/modules-load.d/containerd.conf <<EOF
overlay
br_netfilter
EOF’
sudo modprobe overlay
sudo modprobe br_netfilter
sudo bash -c ‘cat > /etc/sysctl.d/99-kubernetes-cri.conf <<EOF
net.bridge.bridge-nf-call-iptables = 1
net.ipv4.ip_forward = 1
net.bridge.bridge-nf-call-ip6tables = 1
EOF’
sudo sysctl --system
sudo apt install -y containerd
sudo mkdir -p /etc/containerd
sudo containerd config default > /etc/containerd/config.toml
sudo systemctl restart containerd
sudo apt update && sudo apt install -y apt-transport-https curl
sudo curl -s https://packages.cloud.google.com/apt/doc/apt-key.gpg | sudo apt-key add -
sudo cat <<EOF | sudo tee /etc/apt/sources.list.d/kubernetes.list
deb https://apt.kubernetes.io/ kubernetes-xenial main
EOF
sudo apt update
sudo apt install -y kubelet kubeadm kubectl
sudo cat <<EOF | sudo tee /etc/default/kubelet
KUBELET_EXTRA_ARGS=–cgroup-driver=systemd
EOF
sudo kubeadm init --pod-network-cidr=10.217.0.0/16
mkdir -p $HOME/.kube
sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
sudo chown $(id -u):$(id -g) $HOME/.kube/config
kubectl create -f https://raw.githubusercontent.com/cilium/cilium/1.9.8/install/kubernetes/quick-install.yaml
kubectl taint nodes --all node-role.kubernetes.io/master-
3)GPU Operator
helm install --wait --generate-name
nvidia/gpu-operator
A)Verify
kubectl get nodes -o wide
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
ip-172-31-5-238 Ready control-plane,master 10m v1.21.3 172.31.5.238 Ubuntu 20.04.1 LTS 5.4.0-1029-aws containerd://1.3.3-0ubuntu2
kubectl get pods -A
NAMESPACE NAME READY STATUS RESTARTS AGE
default cuda-vectoradd 0/1 Pending 0 5m18s
default gpu-operator-1628076209-node-feature-discovery-master-5569rxbkv 1/1 Running 0 7m20s
default gpu-operator-1628076209-node-feature-discovery-worker-wb5kx 1/1 Running 0 7m20s
default gpu-operator-6b5666bb8b-m84jk 1/1 Running 0 7m20s
kube-system cilium-lffvb 1/1 Running 0 10m
kube-system cilium-operator-6bf8f5748c-sfd2z 1/1 Running 0 10m
kube-system coredns-558bd4d5db-699xp 1/1 Running 0 10m
kube-system coredns-558bd4d5db-l5gkk 1/1 Running 0 10m
kube-system etcd-ip-172-31-5-238 1/1 Running 0 11m
kube-system kube-apiserver-ip-172-31-5-238 1/1 Running 0 11m
kube-system kube-controller-manager-ip-172-31-5-238 1/1 Running 0 11m
kube-system kube-proxy-z7hlx 1/1 Running 0 10m
kube-system kube-scheduler-ip-172-31-5-238 1/1 Running 0 10m
B)Try to run container
cat << EOF | kubectl create -f -
apiVersion: v1
kind: Pod
metadata:
name: cuda-vectoradd
spec:
restartPolicy: OnFailure
containers:
- name: cuda-vectoradd
image: “nvidia/samples:vectoradd-cuda11.2.1”
resources: # intentionally omit gpu resouce limit for GPU sharing
EOF
C) Container Check
kubectl describe pods cuda-vectoradd
Name: cuda-vectoradd
Namespace: default
Priority: 0
Node: ip-172-31-47-101/172.31.47.101
Start Time: Wed, 04 Aug 2021 12:55:24 +0000
Labels:
Annotations:
Status: Running
IP: 10.0.0.133
IPs:
IP: 10.0.0.133
Containers:
cuda-vectoradd:
Container ID: containerd://0dd4e20612490e92001baa7fab5d8c5aca861a011421c6c420197cc0e644e320
Image: nvidia/samples:vectoradd-cuda11.2.1
Image ID: docker.io/nvidia/samples@sha256:ea7e32c1552485cc0093d66a55883d1624a963c6cfaff21db4bd57b50ab27eae
Port:
Host Port:
State: Terminated
Reason: Error
Exit Code: 1
Started: Wed, 04 Aug 2021 12:55:39 +0000
Finished: Wed, 04 Aug 2021 12:55:39 +0000
Last State: Terminated
Reason: Error
Exit Code: 1
Started: Wed, 04 Aug 2021 12:55:25 +0000
Finished: Wed, 04 Aug 2021 12:55:25 +0000
Ready: False
Restart Count: 2
Environment:
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-ppj4b (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
kube-api-access-ppj4b:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional:
DownwardAPI: true
QoS Class: BestEffort
Node-Selectors:
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
Normal Scheduled 21s default-scheduler Successfully assigned default/cuda-vectoradd to ip-172-31-47-101
Normal Pulled 7s (x3 over 20s) kubelet Container image “nvidia/samples:vectoradd-cuda11.2.1” already present on machine
Normal Created 6s (x3 over 20s) kubelet Created container cuda-vectoradd
Normal Started 6s (x3 over 20s) kubelet Started container cuda-vectoradd
Warning BackOff 6s (x3 over 19s) kubelet Back-off restarting failed container
D) Container Check 2
kubectl logs cuda-vectoradd
Failed to allocate device vector A (error code CUDA driver version is insufficient for CUDA runtime version)!
[Vector addition of 50000 elements]