Completely purge and reinstall nvidia gpu operator

Yes, I think so. You can leverage the setup.sh inside the TAO 5.0. During the log, you can also find some helm commands to install nvidia-gpu-operator.

Hi,
I can run uninstall and reinstall nvidia-gpu-operator with below commands.
I test on A40 successfully. Please try on your side.

Uninstall:

$ helm delete -n nvidia-gpu-operator $(helm list -n nvidia-gpu-operator | grep nvidia-gpu-operator | awk '{print $1}')

Install:

$ helm show --version=v23.3.2 values nvidia/gpu-operator > /tmp/values.yaml
$ helm install --version 23.3.2 --values /tmp/values.yaml --create-namespace --namespace nvidia-gpu-operator --devel nvidia/gpu-operator --set driver.enabled=False --set driver.repository='nvcr.io/nvidia',driver.imagePullSecrets[0]=registry-secret,driver.licensingConfig.configMapName=licensing-config,driver.version='525.85.12' --wait --generate-name

just wondering shouldn’t this have --set toolkit.enabled=false too if we are using a DGX (because both the drivers and the nvidia container toolkit is already installed in DGXs)?

Adding --set toolkit.enabled=false in the command line will result in bad pods. So I did not use it on my steps.

Intersting! was the nvidia-operator-validator-*** the ā€œbadā€ pod (with Init:CrashLoopBackOff/Error) ?

Not the same. They are above pods.

do a watch kubectl get pods -n nvidia-gpu-operator and see if the nvidia-operator-validator-7wgwv pod goes into error please (with --set toolkit.enabled=false when installing the chart).

I got the error again even after resetting the cluster, very strange (maybe I didn’t reset deep enough, meaning i didn’t delete everytin, just kubeadm reset wasn’t enough?)

I confirm that the pod goes into error if with --set toolkit.enabled=false.
If run without --set toolkit.enabled=false, there is no error.

1 Like

Thanks a lot for this! I can also confirm that not having that line solves my problem too, but then I’m wondering if it created problems I have yet to find out because that is not what the documentation says because DGX already has the container toolkit installed (however I’m runninng this helm command on the k8 master node which is a normal fujitsu cpu server)

here is the long version

helm uninstall -n gpu-operator $(helm list -n gpu-operator | grep gpu-operator | awk '{print $1}')
release "gpu-operator-1692955459" uninstalled

delete clusterpolicy

kubectl delete crd clusterpolicies.nvidia.com

Install commands

get values

helm show --version=v23.3.2 values nvidia/gpu-operator > gpu_operator_values.yaml

helm command (run on control plane terminal)

helm install --version 23.3.2 --values gpu_operator_values.yaml --create-namespace --namespace gpu-operator --devel nvidia/gpu-operator --set driver.enabled=False --set driver.repository='nvcr.io/nvidia',driver.imagePullSecrets[0]=registry-secret,driver.licensingConfig.configMapName=licensing-config,driver.version='525.85.12' --wait --generate-name

Then voila it worked!!

g@gsrv:~$ kubectl get pods -A 
NAMESPACE       NAME                                                              READY   STATUS      RESTARTS   AGE
calico-system   calico-kube-controllers-658996b7c6-9dcjr                          1/1     Running     0          19h
calico-system   calico-node-4kt6f                                                 1/1     Running     0          19h
calico-system   calico-node-rh2pg                                                 1/1     Running     0          70m
calico-system   calico-typha-767789dd9-kf9r6                                      1/1     Running     0          19h
calico-system   csi-node-driver-2sncd                                             2/2     Running     0          70m
calico-system   csi-node-driver-9cfk5                                             2/2     Running     0          19h
gpu-operator    gpu-feature-discovery-d6qm2                                       1/1     Running     0          5m11s
gpu-operator    gpu-operator-1692958626-node-feature-discovery-master-5867w7mts   1/1     Running     0          5m32s
gpu-operator    gpu-operator-1692958626-node-feature-discovery-worker-hxtb2       1/1     Running     0          5m32s
gpu-operator    gpu-operator-79766c58c4-wwbg5                                     1/1     Running     0          5m32s
gpu-operator    nvidia-container-toolkit-daemonset-d64zs                          1/1     Running     0          5m11s
gpu-operator    nvidia-cuda-validator-w5qzg                                       0/1     Completed   0          4m33s
gpu-operator    nvidia-dcgm-exporter-khnnw                                        1/1     Running     0          5m11s
gpu-operator    nvidia-device-plugin-daemonset-rhwrd                              1/1     Running     0          5m11s
gpu-operator    nvidia-device-plugin-validator-n2rg7                              0/1     Completed   0          3m7s
gpu-operator    nvidia-mig-manager-mm2pr                                          1/1     Running     0          2m17s
gpu-operator    nvidia-operator-validator-g65q4                                   1/1     Running     0          5m11s
kube-system     coredns-57575c5f89-fp9z8                                          1/1     Running     0          19h
kube-system     coredns-57575c5f89-ln2d6                                          1/1     Running     0          19h
kube-system     etcd-gsrv                                                         1/1     Running     0          19h
kube-system     kube-apiserver-gsrv                                               1/1     Running     0          19h
kube-system     kube-controller-manager-gsrv                                      1/1     Running     0          19h
kube-system     kube-proxy-n7rpf                                                  1/1     Running     0          19h
kube-system     kube-proxy-p5v5f                                                  1/1     Running     0          70m
kube-system     kube-scheduler-gsrv                                               1/1     Running     0          19h

Note: (a new nvidia-container-toolkit-daemonset-*** is created by the change and took about 5 and half minutes to settle down and get to running state, even before it took a few minutues to get into the working state so no real change in time IMO)
please find the logs where all pods in the namespace is described
logs_from_all_pods_in_gpu_operator_namespace_when_working.txt (61.3 KB)

And now this begs a question

Can you please confirm

--set toolkit.enabled=false

is not needed in my case? I only used it because it was on the official guidance

and it stresses that we need to use that particular command!

These steps should be followed when using the GPU Operator v1.9+ on DGX A100 systems with DGX OS 5.1+.

and I’m sure it worked before (until for some reason it stopped working mysteriously, (maybe after an update of some sorts?))

my current questions is is this fine with the DGX now?

EDIT:

I think in the dgx I need to use the toolkit for the changes to /etc/containerd/config.toml to take effect, because I seem to not be able to run gpu jobs even with the gpu-operator installed without the --set toolkit.enabled=false

So I gess we need to fix the nvidia-operator-validator-*** issue. even re installng seem not to fix things now. I’m not sure if this was caused by an update or I’ve missed some critical bit.

Can you please try with a DGX station A100 (where the driver and toolkit is already installed) instead of A40s? at this all I’m doing is installing the cluster , calico, add the dgx node and try to install the gpu-operator. (which I;ve done quite a few times without an issue)

Pretty mich I’m doing this (my notes) simplified k8 install instructions | DL Docs (should be straightforward to follow and replicate)

my daemon.json and config.toml files (these are slighly clustomised for the local docker rep for the serverless function images)
daemon.json (228 Bytes)
config.toml (7.3 KB)

in the gpu-operator logs I can see the following portion on repeat

{"level":"info","ts":1692994607.3503451,"logger":"controllers.ClusterPolicy","msg":"Sandbox workloads","Enabled":false,"DefaultWorkload":"container"}
{"level":"info","ts":1692994607.3505776,"logger":"controllers.ClusterPolicy","msg":"GPU workload configuration","NodeName":"gsrv","GpuWorkloadConfig":"container"}
{"level":"info","ts":1692994607.3506613,"logger":"controllers.ClusterPolicy","msg":"GPU workload configuration","NodeName":"dgx","GpuWorkloadConfig":"container"}
{"level":"info","ts":1692994607.3507133,"logger":"controllers.ClusterPolicy","msg":"Checking GPU state labels on the node","NodeName":"dgx"}
{"level":"info","ts":1692994607.3507454,"logger":"controllers.ClusterPolicy","msg":"Number of nodes with GPU label","NodeCount":1}
{"level":"info","ts":1692994607.3509986,"logger":"controllers.ClusterPolicy","msg":"Using container runtime: containerd"}
{"level":"info","ts":1692994607.351055,"logger":"controllers.ClusterPolicy","msg":"Found Resource, updating...","RuntimeClass":"nvidia"}
{"level":"info","ts":1692994607.374309,"logger":"controllers.ClusterPolicy","msg":"ClusterPolicy step completed","state:":"pre-requisites","status":"ready"}
{"level":"info","ts":1692994607.374411,"logger":"controllers.ClusterPolicy","msg":"Found Resource, updating...","Service":"gpu-operator","Namespace":"gpu-operator"}
{"level":"info","ts":1692994607.3809211,"logger":"controllers.ClusterPolicy","msg":"ClusterPolicy step completed","state:":"state-operator-metrics","status":"ready"}
{"level":"info","ts":1692994607.4035978,"logger":"controllers.ClusterPolicy","msg":"ClusterPolicy step completed","state:":"state-driver","status":"disabled"}
{"level":"info","ts":1692994607.4213452,"logger":"controllers.ClusterPolicy","msg":"ClusterPolicy step completed","state:":"state-container-toolkit","status":"disabled"}
{"level":"info","ts":1692994607.4287944,"logger":"controllers.ClusterPolicy","msg":"Found Resource, skipping update","ServiceAccount":"nvidia-operator-validator","Namespace":"gpu-operator"}
{"level":"info","ts":1692994607.4356089,"logger":"controllers.ClusterPolicy","msg":"Found Resource, updating...","Role":"nvidia-operator-validator","Namespace":"gpu-operator"}
{"level":"info","ts":1692994607.4469094,"logger":"controllers.ClusterPolicy","msg":"Found Resource, updating...","ClusterRole":"nvidia-operator-validator","Namespace":"gpu-operator"}
{"level":"info","ts":1692994607.459245,"logger":"controllers.ClusterPolicy","msg":"Found Resource, updating...","RoleBinding":"nvidia-operator-validator","Namespace":"gpu-operator"}
{"level":"info","ts":1692994607.4696825,"logger":"controllers.ClusterPolicy","msg":"Found Resource, updating...","ClusterRoleBinding":"nvidia-operator-validator","Namespace":"gpu-operator"}
{"level":"info","ts":1692994607.4755728,"logger":"controllers.ClusterPolicy","msg":"DaemonSet identical, skipping update","DaemonSet":"nvidia-operator-validator","Namespace":"gpu-operator","name":"nvidia-operator-validator"}
{"level":"info","ts":1692994607.4756615,"logger":"controllers.ClusterPolicy","msg":"daemonset not ready","name":"nvidia-operator-validator"}
{"level":"info","ts":1692994607.4756808,"logger":"controllers.ClusterPolicy","msg":"ClusterPolicy step completed","state:":"state-operator-validation","status":"notReady"}
{"level":"info","ts":1692994607.4813743,"logger":"controllers.ClusterPolicy","msg":"Found Resource, skipping update","ServiceAccount":"nvidia-device-plugin","Namespace":"gpu-operator"}
{"level":"info","ts":1692994607.4873807,"logger":"controllers.ClusterPolicy","msg":"Found Resource, updating...","Role":"nvidia-device-plugin","Namespace":"gpu-operator"}
{"level":"info","ts":1692994607.4979827,"logger":"controllers.ClusterPolicy","msg":"Found Resource, updating...","ClusterRole":"nvidia-device-plugin","Namespace":"gpu-operator"}
{"level":"info","ts":1692994607.509261,"logger":"controllers.ClusterPolicy","msg":"Found Resource, updating...","RoleBinding":"nvidia-device-plugin","Namespace":"gpu-operator"}
{"level":"info","ts":1692994607.5200124,"logger":"controllers.ClusterPolicy","msg":"Found Resource, updating...","ClusterRoleBinding":"nvidia-device-plugin","Namespace":"gpu-operator"}
{"level":"info","ts":1692994607.5315511,"logger":"controllers.ClusterPolicy","msg":"Found Resource, updating...","ConfigMap":"nvidia-device-plugin-entrypoint","Namespace":"gpu-operator"}
{"level":"info","ts":1692994607.5382147,"logger":"controllers.ClusterPolicy","msg":"DaemonSet identical, skipping update","DaemonSet":"nvidia-device-plugin-daemonset","Namespace":"gpu-operator","name":"nvidia-device-plugin-daemonset"}
{"level":"info","ts":1692994607.5383413,"logger":"controllers.ClusterPolicy","msg":"daemonset not ready","name":"nvidia-device-plugin-daemonset"}
{"level":"info","ts":1692994607.5383847,"logger":"controllers.ClusterPolicy","msg":"ClusterPolicy step completed","state:":"state-device-plugin","status":"notReady"}
{"level":"info","ts":1692994607.5546553,"logger":"controllers.ClusterPolicy","msg":"ClusterPolicy step completed","state:":"state-dcgm","status":"disabled"}
{"level":"info","ts":1692994607.560827,"logger":"controllers.ClusterPolicy","msg":"Found Resource, skipping update","ServiceAccount":"nvidia-dcgm-exporter","Namespace":"gpu-operator"}
{"level":"info","ts":1692994607.5678113,"logger":"controllers.ClusterPolicy","msg":"Found Resource, updating...","Role":"nvidia-dcgm-exporter","Namespace":"gpu-operator"}
{"level":"info","ts":1692994607.5798917,"logger":"controllers.ClusterPolicy","msg":"Found Resource, updating...","RoleBinding":"nvidia-dcgm-exporter","Namespace":"gpu-operator"}
{"level":"info","ts":1692994607.5850806,"logger":"controllers.ClusterPolicy","msg":"Found Resource, updating...","Service":"nvidia-dcgm-exporter","Namespace":"gpu-operator"}
{"level":"info","ts":1692994607.592655,"logger":"controllers.ClusterPolicy","msg":"DaemonSet identical, skipping update","DaemonSet":"nvidia-dcgm-exporter","Namespace":"gpu-operator","name":"nvidia-dcgm-exporter"}
{"level":"info","ts":1692994607.5927503,"logger":"controllers.ClusterPolicy","msg":"daemonset not ready","name":"nvidia-dcgm-exporter"}
{"level":"info","ts":1692994607.5927699,"logger":"controllers.ClusterPolicy","msg":"ClusterPolicy step completed","state:":"state-dcgm-exporter","status":"notReady"}
{"level":"info","ts":1692994607.599265,"logger":"controllers.ClusterPolicy","msg":"Found Resource, skipping update","ServiceAccount":"nvidia-gpu-feature-discovery","Namespace":"gpu-operator"}
{"level":"info","ts":1692994607.6060112,"logger":"controllers.ClusterPolicy","msg":"Found Resource, updating...","Role":"nvidia-gpu-feature-discovery","Namespace":"gpu-operator"}
{"level":"info","ts":1692994607.6171765,"logger":"controllers.ClusterPolicy","msg":"Found Resource, updating...","ClusterRole":"nvidia-gpu-feature-discovery","Namespace":"gpu-operator"}
{"level":"info","ts":1692994607.628706,"logger":"controllers.ClusterPolicy","msg":"Found Resource, updating...","RoleBinding":"nvidia-gpu-feature-discovery","Namespace":"gpu-operator"}
{"level":"info","ts":1692994607.6393013,"logger":"controllers.ClusterPolicy","msg":"Found Resource, updating...","ClusterRoleBinding":"nvidia-gpu-feature-discovery","Namespace":"gpu-operator"}
{"level":"info","ts":1692994607.645476,"logger":"controllers.ClusterPolicy","msg":"DaemonSet identical, skipping update","DaemonSet":"gpu-feature-discovery","Namespace":"gpu-operator","name":"gpu-feature-discovery"}
{"level":"info","ts":1692994607.6455667,"logger":"controllers.ClusterPolicy","msg":"daemonset not ready","name":"gpu-feature-discovery"}
{"level":"info","ts":1692994607.6455905,"logger":"controllers.ClusterPolicy","msg":"ClusterPolicy step completed","state:":"gpu-feature-discovery","status":"notReady"}
{"level":"info","ts":1692994607.6507008,"logger":"controllers.ClusterPolicy","msg":"Found Resource, skipping update","ServiceAccount":"nvidia-mig-manager","Namespace":"gpu-operator"}
{"level":"info","ts":1692994607.6554327,"logger":"controllers.ClusterPolicy","msg":"Found Resource, updating...","Role":"nvidia-mig-manager","Namespace":"gpu-operator"}
{"level":"info","ts":1692994607.663852,"logger":"controllers.ClusterPolicy","msg":"Found Resource, updating...","ClusterRole":"nvidia-mig-manager","Namespace":"gpu-operator"}
{"level":"info","ts":1692994607.6720166,"logger":"controllers.ClusterPolicy","msg":"Found Resource, updating...","RoleBinding":"nvidia-mig-manager","Namespace":"gpu-operator"}
{"level":"info","ts":1692994607.6806715,"logger":"controllers.ClusterPolicy","msg":"Found Resource, updating...","ClusterRoleBinding":"nvidia-mig-manager","Namespace":"gpu-operator"}
{"level":"info","ts":1692994607.6905217,"logger":"controllers.ClusterPolicy","msg":"Found Resource, updating...","ConfigMap":"default-mig-parted-config","Namespace":"gpu-operator"}
{"level":"info","ts":1692994607.6994987,"logger":"controllers.ClusterPolicy","msg":"Found Resource, updating...","ConfigMap":"default-gpu-clients","Namespace":"gpu-operator"}
{"level":"info","ts":1692994607.7091198,"logger":"controllers.ClusterPolicy","msg":"Found Resource, updating...","ConfigMap":"nvidia-mig-manager-entrypoint","Namespace":"gpu-operator"}
{"level":"info","ts":1692994607.7135487,"logger":"controllers.ClusterPolicy","msg":"DaemonSet identical, skipping update","DaemonSet":"nvidia-mig-manager","Namespace":"gpu-operator","name":"nvidia-mig-manager"}
{"level":"info","ts":1692994607.7136018,"logger":"controllers.ClusterPolicy","msg":"ClusterPolicy step completed","state:":"state-mig-manager","status":"ready"}
{"level":"info","ts":1692994607.731519,"logger":"controllers.ClusterPolicy","msg":"ClusterPolicy step completed","state:":"state-node-status-exporter","status":"disabled"}
{"level":"info","ts":1692994607.7502105,"logger":"controllers.ClusterPolicy","msg":"ClusterPolicy step completed","state:":"state-vgpu-manager","status":"disabled"}
{"level":"info","ts":1692994607.766168,"logger":"controllers.ClusterPolicy","msg":"ClusterPolicy step completed","state:":"state-vgpu-device-manager","status":"disabled"}
{"level":"info","ts":1692994607.7842896,"logger":"controllers.ClusterPolicy","msg":"ClusterPolicy step completed","state:":"state-sandbox-validation","status":"disabled"}
{"level":"info","ts":1692994607.804841,"logger":"controllers.ClusterPolicy","msg":"ClusterPolicy step completed","state:":"state-vfio-manager","status":"disabled"}
{"level":"info","ts":1692994607.822285,"logger":"controllers.ClusterPolicy","msg":"ClusterPolicy step completed","state:":"state-sandbox-device-plugin","status":"disabled"}
{"level":"info","ts":1692994607.845774,"logger":"controllers.ClusterPolicy","msg":"Kata Manager disabled, deleting all Kata RuntimeClasses"}
{"level":"info","ts":1692994607.8458114,"logger":"controllers.ClusterPolicy","msg":"ClusterPolicy step completed","state:":"state-kata-manager","status":"disabled"}
{"level":"info","ts":1692994607.8648727,"logger":"controllers.ClusterPolicy","msg":"ClusterPolicy step completed","state:":"state-cc-manager","status":"disabled"}
{"level":"info","ts":1692994607.8649228,"logger":"controllers.ClusterPolicy","msg":"ClusterPolicy isn't ready","states not ready":["state-operator-validation","state-device-plugin","state-dcgm-exporter","gpu-feature-discovery"]}

full log
gpu-operator-9974dbcfc-6ftnf.txt (2.9 MB)

Hi Ganindu,
For the issue when reinstalling gpu-operator in DGX station A100, could you please create a topic in DGX User Forum - NVIDIA Developer Forums or use About the DGX User Forum / Note: this is not NVIDIA Enterprise Support to generate a ticket for enterprise support? Since the issue is related to the software stack in DGX, and also is out of the scope of TAO, so it is better for you to get help from DGX team directly for better hints. More, you can also search topics or create topics on Issues Ā· NVIDIA/gpu-operator Ā· GitHub.

I lease one DGX-1(with V100) and run successfully for uninstalling and installing.

Step:
$ helm ls -n nvidia-gpu-operator
$ helm uninstall --wait gpu-operator-1693131264 -n nvidia-gpu-operator
$ sudo vim /etc/containerd/config.toml (follow Getting Started — NVIDIA GPU Operator 23.9.0 documentation, section ā€œContainerd:ā€ )
$ sudo systemctl restart containerd
$ helm install --wait --generate-name -n nvidia-gpu-operator --create-namespace nvidia/gpu-operator --set driver.enabled=false --set toolkit.enabled=false

That’s great!! The DGX is and the driver version upto the latest version?

I lease a DGX-1(with V100) machine and run sudo apt update on it.
Below is latest info.

local-morganh@nvc-dgx1-001:~$ cat /etc/dgx-release
DGX_NAME="DGX Server"
DGX_PRETTY_NAME="NVIDIA DGX Server"
DGX_SWBUILD_DATE="2022-06-06-13-55-03"
DGX_SWBUILD_VERSION="5.3.1"
DGX_COMMIT_ID=""
DGX_PLATFORM="DGX Server for DGX-1"
DGX_SERIAL_NUMBER="xxxxxxxxxx"
local-morganh@nvc-dgx1-001:~$ nvidia-smi
Sun Aug 27 18:51:22 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.125.06   Driver Version: 525.125.06   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  On   | 00000000:06:00.0 Off |                    0 |
| N/A   33C    P0    43W / 300W |      0MiB / 32768MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2...  On   | 00000000:07:00.0 Off |                    0 |
| N/A   36C    P0    44W / 300W |      0MiB / 32768MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-SXM2...  On   | 00000000:0A:00.0 Off |                    0 |
| N/A   34C    P0    42W / 300W |      0MiB / 32768MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-SXM2...  On   | 00000000:0B:00.0 Off |                    0 |
| N/A   33C    P0    43W / 300W |      0MiB / 32768MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   4  Tesla V100-SXM2...  On   | 00000000:85:00.0 Off |                    0 |
| N/A   33C    P0    43W / 300W |      0MiB / 32768MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   5  Tesla V100-SXM2...  On   | 00000000:86:00.0 Off |                    0 |
| N/A   36C    P0    44W / 300W |      0MiB / 32768MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   6  Tesla V100-SXM2...  On   | 00000000:89:00.0 Off |                    0 |
| N/A   39C    P0    44W / 300W |      0MiB / 32768MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   7  Tesla V100-SXM2...  On   | 00000000:8A:00.0 Off |                    0 |
| N/A   36C    P0    44W / 300W |      0MiB / 32768MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

BTW, I fixed ā€œGPG errorā€ before sudo apt update. Then, $ sudo apt install nvidia-driver-525. Also, I install TAO5.0 via $ bash setup.sh install to install k8s, etc.

Thanks for this! my values are!

OS verison

g@dgx:~$ cat /etc/dgx-release
DGX_NAME="DGX Station A100"
DGX_PRETTY_NAME="NVIDIA DGX Station A100"
DGX_SWBUILD_DATE="2021-02-04-09-33-25"
DGX_SWBUILD_VERSION="5.0.2"
DGX_COMMIT_ID="bcf2581"
DGX_PLATFORM="DGX Station A100"
DGX_SERIAL_NUMBER="1560422011502"

DGX_OTA_VERSION="5.4.2"
DGX_OTA_DATE="Fri 27 Jan 2023 03:51:24 PM GMT"

DGX_OTA_VERSION="5.4.2"
DGX_OTA_DATE="Mon 06 Mar 2023 12:02:17 PM GMT"

DGX_OTA_VERSION="5.4.2"
DGX_OTA_DATE="Tue 14 Mar 2023 02:26:48 PM GMT"

DGX_OTA_VERSION="5.5.0"
DGX_OTA_DATE="Tue 04 Apr 2023 10:35:33 AM BST"

DGX_OTA_VERSION="6.0.11"
DGX_OTA_DATE="Tue Jun  6 04:32:24 PM BST 2023"

DGX_OTA_VERSION="6.1.0"
DGX_OTA_DATE="Tue 22 Aug 11:14:55 BST 2023"
g@dgx:~$ 

Driver verison

g@dgx:~$ nvidia-smi
Tue Aug 29 09:09:34 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.125.06   Driver Version: 525.125.06   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100-SXM...  On   | 00000000:01:00.0 Off |                    0 |
| N/A   33C    P0    51W / 275W |      4MiB / 40960MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-SXM...  On   | 00000000:47:00.0 Off |                    0 |
| N/A   34C    P0    52W / 275W |      4MiB / 40960MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA A100-SXM...  On   | 00000000:81:00.0 Off |                    0 |
| N/A   33C    P0    53W / 275W |      4MiB / 40960MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA DGX Display  On   | 00000000:C1:00.0 Off |                  N/A |
| 34%   34C    P8    N/A /  50W |      5MiB /  4096MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   4  NVIDIA A100-SXM...  On   | 00000000:C2:00.0 Off |                    0 |
| N/A   33C    P0    50W / 275W |      4MiB / 40960MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      6213      G   /usr/lib/xorg/Xorg                  4MiB |
|    1   N/A  N/A      6213      G   /usr/lib/xorg/Xorg                  4MiB |
|    2   N/A  N/A      6213      G   /usr/lib/xorg/Xorg                  4MiB |
|    3   N/A  N/A      6213      G   /usr/lib/xorg/Xorg                  4MiB |
|    4   N/A  N/A      6213      G   /usr/lib/xorg/Xorg                  4MiB |
+-----------------------------------------------------------------------------+

the only difference to the setup (aprt from using a DGX station instead of a server) is in the /etc/containerd/config.toml file. (should I get rid of the extra lines?)

    [plugins."io.containerd.grpc.v1.cri".containerd]
      default_runtime_name = "nvidia"
      disable_snapshot_annotations = true
      discard_unpacked_layers = false
      ignore_rdt_not_enabled_errors = false
      no_pivot = false
      snapshotter = "overlayfs"

I recall that I only add lines.
20230828_dgx-1_yaml.txt (7.6 KB)

Attach my yaml for reference.

The differences I see are

(yours on the left)

ordering

runc options

I also made a case in the dgx support forum!

full file
containerd_yaml.txt (7.3 KB)

OK.
I did not do below,

after checking, what I did is as below. See ā€œaddedā€ lines.

 78     [plugins."io.containerd.grpc.v1.cri".containerd]
 79       disable_snapshot_annotations = true
 80       discard_unpacked_layers = false
 81       ignore_rdt_not_enabled_errors = false
 82       no_pivot = false
 83       snapshotter = "overlayfs"
 84       default_runtime_name = "nvidia"    //added
 85
 86       [plugins."io.containerd.grpc.v1.cri".containerd.default_runtime]
 87         base_runtime_spec = ""
 88         cni_conf_dir = ""
 89         cni_max_conf_num = 0
 90         container_annotations = []
 91         pod_annotations = []
 92         privileged_without_host_devices = false
 93         runtime_engine = ""
 94         runtime_path = ""
 95         runtime_root = ""
 96         runtime_type = ""
 97
 98         [plugins."io.containerd.grpc.v1.cri".containerd.default_runtime.options]
 99
100       [plugins."io.containerd.grpc.v1.cri".containerd.runtimes]
101         [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia] //added
102           privileged_without_host_devices = false   //added
103           runtime_engine = ""   //added
104           runtime_root = ""         //added
105           runtime_type = "io.containerd.runc.v2"    //added
106           [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]   //added
107             BinaryName = "/usr/bin/nvidia-container-runtime"   //added
108
109         [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
110           base_runtime_spec = ""
111           cni_conf_dir = ""
112           cni_max_conf_num = 0
113           container_annotations = []
114           pod_annotations = []
115           privileged_without_host_devices = false
116           runtime_engine = ""
117           runtime_path = ""
118           runtime_root = ""
119           runtime_type = "io.containerd.runc.v2"

Maybe the changes you didn’t make is because your DGX os is 5.3.1

I am at 6.1.0

Not sure it’s the cause of the issue but in my eyes it just contributes to the fact that the things are different. I will try to get things as similar as possible (within reason) and do another purge install run for the gpu-operator.