Is the vectoradd-cuda container for 11.4 available?

I try to execute vectoradd-cuda 11.2.1 container on current CUDA 11.4.1. But failed to execute it.
Is there any container to run on CUDA 11.4.1?

$ nvidia-smi
Wed Aug  4 13:10:49 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.57.02    Driver Version: 470.57.02    CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            On   | 00000000:00:1E.0 Off |                    0 |
| N/A   27C    P8     9W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
$ sudo docker run nvidia/samples:vectoradd-cuda11.2.1
[Vector addition of 50000 elements]
Failed to allocate device vector A (error code CUDA driver version is insufficient for CUDA runtime version)!

Reference
CUDA Toolkit 11.4 Update 1 Downloads
CUDA Toolkit 11.7 Update 1 Downloads | NVIDIA Developer

Before trying to run the container, verify your CUDA 11.4.1 install. Verification instructions are listed in the CUDA linux install guide.

1 Like

Thank your suggestion.
cuda-samples itself is working but vectoradd-cuda:11.2.1 is still not working.

$ cat /proc/driver/nvidia/version
NVRM version: NVIDIA UNIX x86_64 Kernel Module 470.57.02 Tue Jul 13 16:14:05 UTC 2021
GCC version: gcc version 9.3.0 (Ubuntu 9.3.0-17ubuntu1~20.04)

~/cuda-samples/Samples/vectorAddMMAP$ ./vectorAddMMAP
Vector Addition (Driver API)

Using CUDA Device [0]: Tesla T4
Device 0 VIRTUAL ADDRESS MANAGEMENT SUPPORTED = 1.
findModulePath found file at <./vectorAdd_kernel64.fatbin>
initCUDA loading module: <./vectorAdd_kernel64.fatbin>
Result = PASS

when running a docker container that intends to use GPUs, it’s usually necessary to specify the GPUs that you will expose to/in the docker container. You could specify for example the --gpus all switch to enable this.

A docker run command might look like:

sudo docker run --rm --gpus all nvidia/samples:vectoradd-cuda11.2.1

Can you try that?

1 Like

Thank you for your suggestion. It works fine on docker.
But still same problem persists on kubernetes (*)

$ distribution=$(. /etc/os-release;echo $ID$VERSION_ID)    && curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -    && curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
$ sudo apt install nvidia-docker2
$ sudo systemctl restart docker
$ sudo docker run --gpus all nvidia/samples:vectoradd-cuda11.2.1
[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done

Reference for kubernetes *

https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/getting-started.html#install-nvidia-gpu-operator

There’s not enough information here for me to be able to help with that.

1 Like

Thank you for your comments.
If you have any comment it would be appreciate.

I install following instructions by using GPU-Operator on Ubuntu 20.04
(Currantly I use AWS EC2, It is easily reproducible)
It seems nvidia-container-toolkit problem. (from seeing docker case)

1)CUDA Setup

wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/cuda-ubuntu2004.pin
sudo mv cuda-ubuntu2004.pin /etc/apt/preferences.d/cuda-repository-pin-600
sudo apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/7fa2af80.pub
sudo add-apt-repository “deb Index of /compute/cuda/repos/ubuntu2004/x86_64 /”
sudo apt-get update
sudo apt-get -y install cuda

2)Kubernetes Inatallation
Following script intentionally remove # since this comment window works as markdown character.

sudo swapoff -a
sudo bash -c ‘cat > /etc/modules-load.d/containerd.conf <<EOF
overlay
br_netfilter
EOF’
sudo modprobe overlay
sudo modprobe br_netfilter
sudo bash -c ‘cat > /etc/sysctl.d/99-kubernetes-cri.conf <<EOF
net.bridge.bridge-nf-call-iptables = 1
net.ipv4.ip_forward = 1
net.bridge.bridge-nf-call-ip6tables = 1
EOF’
sudo sysctl --system

sudo apt install -y containerd
sudo mkdir -p /etc/containerd
sudo containerd config default > /etc/containerd/config.toml
sudo systemctl restart containerd
sudo apt update && sudo apt install -y apt-transport-https curl
sudo curl -s https://packages.cloud.google.com/apt/doc/apt-key.gpg | sudo apt-key add -
sudo cat <<EOF | sudo tee /etc/apt/sources.list.d/kubernetes.list
deb https://apt.kubernetes.io/ kubernetes-xenial main
EOF
sudo apt update
sudo apt install -y kubelet kubeadm kubectl
sudo cat <<EOF | sudo tee /etc/default/kubelet
KUBELET_EXTRA_ARGS=–cgroup-driver=systemd
EOF

sudo kubeadm init --pod-network-cidr=10.217.0.0/16

mkdir -p $HOME/.kube
sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
sudo chown $(id -u):$(id -g) $HOME/.kube/config
kubectl create -f https://raw.githubusercontent.com/cilium/cilium/1.9.8/install/kubernetes/quick-install.yaml
kubectl taint nodes --all node-role.kubernetes.io/master-

3)GPU Operator

helm install --wait --generate-name
nvidia/gpu-operator

A)Verify

kubectl get nodes -o wide
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
ip-172-31-5-238 Ready control-plane,master 10m v1.21.3 172.31.5.238 Ubuntu 20.04.1 LTS 5.4.0-1029-aws containerd://1.3.3-0ubuntu2

kubectl get pods -A
NAMESPACE NAME READY STATUS RESTARTS AGE
default cuda-vectoradd 0/1 Pending 0 5m18s
default gpu-operator-1628076209-node-feature-discovery-master-5569rxbkv 1/1 Running 0 7m20s
default gpu-operator-1628076209-node-feature-discovery-worker-wb5kx 1/1 Running 0 7m20s
default gpu-operator-6b5666bb8b-m84jk 1/1 Running 0 7m20s
kube-system cilium-lffvb 1/1 Running 0 10m
kube-system cilium-operator-6bf8f5748c-sfd2z 1/1 Running 0 10m
kube-system coredns-558bd4d5db-699xp 1/1 Running 0 10m
kube-system coredns-558bd4d5db-l5gkk 1/1 Running 0 10m
kube-system etcd-ip-172-31-5-238 1/1 Running 0 11m
kube-system kube-apiserver-ip-172-31-5-238 1/1 Running 0 11m
kube-system kube-controller-manager-ip-172-31-5-238 1/1 Running 0 11m
kube-system kube-proxy-z7hlx 1/1 Running 0 10m
kube-system kube-scheduler-ip-172-31-5-238 1/1 Running 0 10m

B)Try to run container

cat << EOF | kubectl create -f -
apiVersion: v1
kind: Pod
metadata:
name: cuda-vectoradd
spec:
restartPolicy: OnFailure
containers:

  • name: cuda-vectoradd
    image: “nvidia/samples:vectoradd-cuda11.2.1”
    resources: # intentionally omit gpu resouce limit for GPU sharing
    EOF

C) Container Check

kubectl describe pods cuda-vectoradd
Name: cuda-vectoradd
Namespace: default
Priority: 0
Node: ip-172-31-47-101/172.31.47.101
Start Time: Wed, 04 Aug 2021 12:55:24 +0000
Labels:
Annotations:
Status: Running
IP: 10.0.0.133
IPs:
IP: 10.0.0.133
Containers:
cuda-vectoradd:
Container ID: containerd://0dd4e20612490e92001baa7fab5d8c5aca861a011421c6c420197cc0e644e320
Image: nvidia/samples:vectoradd-cuda11.2.1
Image ID: docker.io/nvidia/samples@sha256:ea7e32c1552485cc0093d66a55883d1624a963c6cfaff21db4bd57b50ab27eae
Port:
Host Port:
State: Terminated
Reason: Error
Exit Code: 1
Started: Wed, 04 Aug 2021 12:55:39 +0000
Finished: Wed, 04 Aug 2021 12:55:39 +0000
Last State: Terminated
Reason: Error
Exit Code: 1
Started: Wed, 04 Aug 2021 12:55:25 +0000
Finished: Wed, 04 Aug 2021 12:55:25 +0000
Ready: False
Restart Count: 2
Environment:
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-ppj4b (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
kube-api-access-ppj4b:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional:
DownwardAPI: true
QoS Class: BestEffort
Node-Selectors:
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message


Normal Scheduled 21s default-scheduler Successfully assigned default/cuda-vectoradd to ip-172-31-47-101
Normal Pulled 7s (x3 over 20s) kubelet Container image “nvidia/samples:vectoradd-cuda11.2.1” already present on machine
Normal Created 6s (x3 over 20s) kubelet Created container cuda-vectoradd
Normal Started 6s (x3 over 20s) kubelet Started container cuda-vectoradd
Warning BackOff 6s (x3 over 19s) kubelet Back-off restarting failed container

D) Container Check 2

kubectl logs cuda-vectoradd
Failed to allocate device vector A (error code CUDA driver version is insufficient for CUDA runtime version)!
[Vector addition of 50000 elements]