Yes, I think so. You can leverage the setup.sh inside the TAO 5.0. During the log, you can also find some helm commands to install nvidia-gpu-operator.
Hi,
I can run uninstall and reinstall nvidia-gpu-operator with below commands.
I test on A40 successfully. Please try on your side.
Uninstall:
$ helm delete -n nvidia-gpu-operator $(helm list -n nvidia-gpu-operator | grep nvidia-gpu-operator | awk '{print $1}')
Install:
$ helm show --version=v23.3.2 values nvidia/gpu-operator > /tmp/values.yaml
$ helm install --version 23.3.2 --values /tmp/values.yaml --create-namespace --namespace nvidia-gpu-operator --devel nvidia/gpu-operator --set driver.enabled=False --set driver.repository='nvcr.io/nvidia',driver.imagePullSecrets[0]=registry-secret,driver.licensingConfig.configMapName=licensing-config,driver.version='525.85.12' --wait --generate-name
just wondering shouldnāt this have --set toolkit.enabled=false too if we are using a DGX (because both the drivers and the nvidia container toolkit is already installed in DGXs)?
Adding --set toolkit.enabled=false in the command line will result in bad pods. So I did not use it on my steps.
Intersting! was the nvidia-operator-validator-*** the ābadā pod (with Init:CrashLoopBackOff/Error) ?
Not the same. They are above pods.
do a watch kubectl get pods -n nvidia-gpu-operator and see if the nvidia-operator-validator-7wgwv pod goes into error please (with --set toolkit.enabled=false when installing the chart).
I got the error again even after resetting the cluster, very strange (maybe I didnāt reset deep enough, meaning i didnāt delete everytin, just kubeadm reset wasnāt enough?)
I confirm that the pod goes into error if with --set toolkit.enabled=false.
If run without --set toolkit.enabled=false, there is no error.
Thanks a lot for this! I can also confirm that not having that line solves my problem too, but then Iām wondering if it created problems I have yet to find out because that is not what the documentation says because DGX already has the container toolkit installed (however Iām runninng this helm command on the k8 master node which is a normal fujitsu cpu server)
here is the long version
helm uninstall -n gpu-operator $(helm list -n gpu-operator | grep gpu-operator | awk '{print $1}')
release "gpu-operator-1692955459" uninstalled
delete clusterpolicy
kubectl delete crd clusterpolicies.nvidia.com
Install commands
get values
helm show --version=v23.3.2 values nvidia/gpu-operator > gpu_operator_values.yaml
helm command (run on control plane terminal)
helm install --version 23.3.2 --values gpu_operator_values.yaml --create-namespace --namespace gpu-operator --devel nvidia/gpu-operator --set driver.enabled=False --set driver.repository='nvcr.io/nvidia',driver.imagePullSecrets[0]=registry-secret,driver.licensingConfig.configMapName=licensing-config,driver.version='525.85.12' --wait --generate-name
Then voila it worked!!
g@gsrv:~$ kubectl get pods -A
NAMESPACE NAME READY STATUS RESTARTS AGE
calico-system calico-kube-controllers-658996b7c6-9dcjr 1/1 Running 0 19h
calico-system calico-node-4kt6f 1/1 Running 0 19h
calico-system calico-node-rh2pg 1/1 Running 0 70m
calico-system calico-typha-767789dd9-kf9r6 1/1 Running 0 19h
calico-system csi-node-driver-2sncd 2/2 Running 0 70m
calico-system csi-node-driver-9cfk5 2/2 Running 0 19h
gpu-operator gpu-feature-discovery-d6qm2 1/1 Running 0 5m11s
gpu-operator gpu-operator-1692958626-node-feature-discovery-master-5867w7mts 1/1 Running 0 5m32s
gpu-operator gpu-operator-1692958626-node-feature-discovery-worker-hxtb2 1/1 Running 0 5m32s
gpu-operator gpu-operator-79766c58c4-wwbg5 1/1 Running 0 5m32s
gpu-operator nvidia-container-toolkit-daemonset-d64zs 1/1 Running 0 5m11s
gpu-operator nvidia-cuda-validator-w5qzg 0/1 Completed 0 4m33s
gpu-operator nvidia-dcgm-exporter-khnnw 1/1 Running 0 5m11s
gpu-operator nvidia-device-plugin-daemonset-rhwrd 1/1 Running 0 5m11s
gpu-operator nvidia-device-plugin-validator-n2rg7 0/1 Completed 0 3m7s
gpu-operator nvidia-mig-manager-mm2pr 1/1 Running 0 2m17s
gpu-operator nvidia-operator-validator-g65q4 1/1 Running 0 5m11s
kube-system coredns-57575c5f89-fp9z8 1/1 Running 0 19h
kube-system coredns-57575c5f89-ln2d6 1/1 Running 0 19h
kube-system etcd-gsrv 1/1 Running 0 19h
kube-system kube-apiserver-gsrv 1/1 Running 0 19h
kube-system kube-controller-manager-gsrv 1/1 Running 0 19h
kube-system kube-proxy-n7rpf 1/1 Running 0 19h
kube-system kube-proxy-p5v5f 1/1 Running 0 70m
kube-system kube-scheduler-gsrv 1/1 Running 0 19h
Note: (a new nvidia-container-toolkit-daemonset-*** is created by the change and took about 5 and half minutes to settle down and get to running state, even before it took a few minutues to get into the working state so no real change in time IMO)
please find the logs where all pods in the namespace is described
logs_from_all_pods_in_gpu_operator_namespace_when_working.txt (61.3 KB)
And now this begs a question
Can you please confirm
--set toolkit.enabled=false
is not needed in my case? I only used it because it was on the official guidance
and it stresses that we need to use that particular command!
and Iām sure it worked before (until for some reason it stopped working mysteriously, (maybe after an update of some sorts?))
my current questions is is this fine with the DGX now?
EDIT:
I think in the dgx I need to use the toolkit for the changes to /etc/containerd/config.toml to take effect, because I seem to not be able to run gpu jobs even with the gpu-operator installed without the --set toolkit.enabled=false
So I gess we need to fix the nvidia-operator-validator-*** issue. even re installng seem not to fix things now. Iām not sure if this was caused by an update or Iāve missed some critical bit.
Can you please try with a DGX station A100 (where the driver and toolkit is already installed) instead of A40s? at this all Iām doing is installing the cluster , calico, add the dgx node and try to install the gpu-operator. (which I;ve done quite a few times without an issue)
Pretty mich Iām doing this (my notes) simplified k8 install instructions | DL Docs (should be straightforward to follow and replicate)
my daemon.json and config.toml files (these are slighly clustomised for the local docker rep for the serverless function images)
daemon.json (228 Bytes)
config.toml (7.3 KB)
in the gpu-operator logs I can see the following portion on repeat
{"level":"info","ts":1692994607.3503451,"logger":"controllers.ClusterPolicy","msg":"Sandbox workloads","Enabled":false,"DefaultWorkload":"container"}
{"level":"info","ts":1692994607.3505776,"logger":"controllers.ClusterPolicy","msg":"GPU workload configuration","NodeName":"gsrv","GpuWorkloadConfig":"container"}
{"level":"info","ts":1692994607.3506613,"logger":"controllers.ClusterPolicy","msg":"GPU workload configuration","NodeName":"dgx","GpuWorkloadConfig":"container"}
{"level":"info","ts":1692994607.3507133,"logger":"controllers.ClusterPolicy","msg":"Checking GPU state labels on the node","NodeName":"dgx"}
{"level":"info","ts":1692994607.3507454,"logger":"controllers.ClusterPolicy","msg":"Number of nodes with GPU label","NodeCount":1}
{"level":"info","ts":1692994607.3509986,"logger":"controllers.ClusterPolicy","msg":"Using container runtime: containerd"}
{"level":"info","ts":1692994607.351055,"logger":"controllers.ClusterPolicy","msg":"Found Resource, updating...","RuntimeClass":"nvidia"}
{"level":"info","ts":1692994607.374309,"logger":"controllers.ClusterPolicy","msg":"ClusterPolicy step completed","state:":"pre-requisites","status":"ready"}
{"level":"info","ts":1692994607.374411,"logger":"controllers.ClusterPolicy","msg":"Found Resource, updating...","Service":"gpu-operator","Namespace":"gpu-operator"}
{"level":"info","ts":1692994607.3809211,"logger":"controllers.ClusterPolicy","msg":"ClusterPolicy step completed","state:":"state-operator-metrics","status":"ready"}
{"level":"info","ts":1692994607.4035978,"logger":"controllers.ClusterPolicy","msg":"ClusterPolicy step completed","state:":"state-driver","status":"disabled"}
{"level":"info","ts":1692994607.4213452,"logger":"controllers.ClusterPolicy","msg":"ClusterPolicy step completed","state:":"state-container-toolkit","status":"disabled"}
{"level":"info","ts":1692994607.4287944,"logger":"controllers.ClusterPolicy","msg":"Found Resource, skipping update","ServiceAccount":"nvidia-operator-validator","Namespace":"gpu-operator"}
{"level":"info","ts":1692994607.4356089,"logger":"controllers.ClusterPolicy","msg":"Found Resource, updating...","Role":"nvidia-operator-validator","Namespace":"gpu-operator"}
{"level":"info","ts":1692994607.4469094,"logger":"controllers.ClusterPolicy","msg":"Found Resource, updating...","ClusterRole":"nvidia-operator-validator","Namespace":"gpu-operator"}
{"level":"info","ts":1692994607.459245,"logger":"controllers.ClusterPolicy","msg":"Found Resource, updating...","RoleBinding":"nvidia-operator-validator","Namespace":"gpu-operator"}
{"level":"info","ts":1692994607.4696825,"logger":"controllers.ClusterPolicy","msg":"Found Resource, updating...","ClusterRoleBinding":"nvidia-operator-validator","Namespace":"gpu-operator"}
{"level":"info","ts":1692994607.4755728,"logger":"controllers.ClusterPolicy","msg":"DaemonSet identical, skipping update","DaemonSet":"nvidia-operator-validator","Namespace":"gpu-operator","name":"nvidia-operator-validator"}
{"level":"info","ts":1692994607.4756615,"logger":"controllers.ClusterPolicy","msg":"daemonset not ready","name":"nvidia-operator-validator"}
{"level":"info","ts":1692994607.4756808,"logger":"controllers.ClusterPolicy","msg":"ClusterPolicy step completed","state:":"state-operator-validation","status":"notReady"}
{"level":"info","ts":1692994607.4813743,"logger":"controllers.ClusterPolicy","msg":"Found Resource, skipping update","ServiceAccount":"nvidia-device-plugin","Namespace":"gpu-operator"}
{"level":"info","ts":1692994607.4873807,"logger":"controllers.ClusterPolicy","msg":"Found Resource, updating...","Role":"nvidia-device-plugin","Namespace":"gpu-operator"}
{"level":"info","ts":1692994607.4979827,"logger":"controllers.ClusterPolicy","msg":"Found Resource, updating...","ClusterRole":"nvidia-device-plugin","Namespace":"gpu-operator"}
{"level":"info","ts":1692994607.509261,"logger":"controllers.ClusterPolicy","msg":"Found Resource, updating...","RoleBinding":"nvidia-device-plugin","Namespace":"gpu-operator"}
{"level":"info","ts":1692994607.5200124,"logger":"controllers.ClusterPolicy","msg":"Found Resource, updating...","ClusterRoleBinding":"nvidia-device-plugin","Namespace":"gpu-operator"}
{"level":"info","ts":1692994607.5315511,"logger":"controllers.ClusterPolicy","msg":"Found Resource, updating...","ConfigMap":"nvidia-device-plugin-entrypoint","Namespace":"gpu-operator"}
{"level":"info","ts":1692994607.5382147,"logger":"controllers.ClusterPolicy","msg":"DaemonSet identical, skipping update","DaemonSet":"nvidia-device-plugin-daemonset","Namespace":"gpu-operator","name":"nvidia-device-plugin-daemonset"}
{"level":"info","ts":1692994607.5383413,"logger":"controllers.ClusterPolicy","msg":"daemonset not ready","name":"nvidia-device-plugin-daemonset"}
{"level":"info","ts":1692994607.5383847,"logger":"controllers.ClusterPolicy","msg":"ClusterPolicy step completed","state:":"state-device-plugin","status":"notReady"}
{"level":"info","ts":1692994607.5546553,"logger":"controllers.ClusterPolicy","msg":"ClusterPolicy step completed","state:":"state-dcgm","status":"disabled"}
{"level":"info","ts":1692994607.560827,"logger":"controllers.ClusterPolicy","msg":"Found Resource, skipping update","ServiceAccount":"nvidia-dcgm-exporter","Namespace":"gpu-operator"}
{"level":"info","ts":1692994607.5678113,"logger":"controllers.ClusterPolicy","msg":"Found Resource, updating...","Role":"nvidia-dcgm-exporter","Namespace":"gpu-operator"}
{"level":"info","ts":1692994607.5798917,"logger":"controllers.ClusterPolicy","msg":"Found Resource, updating...","RoleBinding":"nvidia-dcgm-exporter","Namespace":"gpu-operator"}
{"level":"info","ts":1692994607.5850806,"logger":"controllers.ClusterPolicy","msg":"Found Resource, updating...","Service":"nvidia-dcgm-exporter","Namespace":"gpu-operator"}
{"level":"info","ts":1692994607.592655,"logger":"controllers.ClusterPolicy","msg":"DaemonSet identical, skipping update","DaemonSet":"nvidia-dcgm-exporter","Namespace":"gpu-operator","name":"nvidia-dcgm-exporter"}
{"level":"info","ts":1692994607.5927503,"logger":"controllers.ClusterPolicy","msg":"daemonset not ready","name":"nvidia-dcgm-exporter"}
{"level":"info","ts":1692994607.5927699,"logger":"controllers.ClusterPolicy","msg":"ClusterPolicy step completed","state:":"state-dcgm-exporter","status":"notReady"}
{"level":"info","ts":1692994607.599265,"logger":"controllers.ClusterPolicy","msg":"Found Resource, skipping update","ServiceAccount":"nvidia-gpu-feature-discovery","Namespace":"gpu-operator"}
{"level":"info","ts":1692994607.6060112,"logger":"controllers.ClusterPolicy","msg":"Found Resource, updating...","Role":"nvidia-gpu-feature-discovery","Namespace":"gpu-operator"}
{"level":"info","ts":1692994607.6171765,"logger":"controllers.ClusterPolicy","msg":"Found Resource, updating...","ClusterRole":"nvidia-gpu-feature-discovery","Namespace":"gpu-operator"}
{"level":"info","ts":1692994607.628706,"logger":"controllers.ClusterPolicy","msg":"Found Resource, updating...","RoleBinding":"nvidia-gpu-feature-discovery","Namespace":"gpu-operator"}
{"level":"info","ts":1692994607.6393013,"logger":"controllers.ClusterPolicy","msg":"Found Resource, updating...","ClusterRoleBinding":"nvidia-gpu-feature-discovery","Namespace":"gpu-operator"}
{"level":"info","ts":1692994607.645476,"logger":"controllers.ClusterPolicy","msg":"DaemonSet identical, skipping update","DaemonSet":"gpu-feature-discovery","Namespace":"gpu-operator","name":"gpu-feature-discovery"}
{"level":"info","ts":1692994607.6455667,"logger":"controllers.ClusterPolicy","msg":"daemonset not ready","name":"gpu-feature-discovery"}
{"level":"info","ts":1692994607.6455905,"logger":"controllers.ClusterPolicy","msg":"ClusterPolicy step completed","state:":"gpu-feature-discovery","status":"notReady"}
{"level":"info","ts":1692994607.6507008,"logger":"controllers.ClusterPolicy","msg":"Found Resource, skipping update","ServiceAccount":"nvidia-mig-manager","Namespace":"gpu-operator"}
{"level":"info","ts":1692994607.6554327,"logger":"controllers.ClusterPolicy","msg":"Found Resource, updating...","Role":"nvidia-mig-manager","Namespace":"gpu-operator"}
{"level":"info","ts":1692994607.663852,"logger":"controllers.ClusterPolicy","msg":"Found Resource, updating...","ClusterRole":"nvidia-mig-manager","Namespace":"gpu-operator"}
{"level":"info","ts":1692994607.6720166,"logger":"controllers.ClusterPolicy","msg":"Found Resource, updating...","RoleBinding":"nvidia-mig-manager","Namespace":"gpu-operator"}
{"level":"info","ts":1692994607.6806715,"logger":"controllers.ClusterPolicy","msg":"Found Resource, updating...","ClusterRoleBinding":"nvidia-mig-manager","Namespace":"gpu-operator"}
{"level":"info","ts":1692994607.6905217,"logger":"controllers.ClusterPolicy","msg":"Found Resource, updating...","ConfigMap":"default-mig-parted-config","Namespace":"gpu-operator"}
{"level":"info","ts":1692994607.6994987,"logger":"controllers.ClusterPolicy","msg":"Found Resource, updating...","ConfigMap":"default-gpu-clients","Namespace":"gpu-operator"}
{"level":"info","ts":1692994607.7091198,"logger":"controllers.ClusterPolicy","msg":"Found Resource, updating...","ConfigMap":"nvidia-mig-manager-entrypoint","Namespace":"gpu-operator"}
{"level":"info","ts":1692994607.7135487,"logger":"controllers.ClusterPolicy","msg":"DaemonSet identical, skipping update","DaemonSet":"nvidia-mig-manager","Namespace":"gpu-operator","name":"nvidia-mig-manager"}
{"level":"info","ts":1692994607.7136018,"logger":"controllers.ClusterPolicy","msg":"ClusterPolicy step completed","state:":"state-mig-manager","status":"ready"}
{"level":"info","ts":1692994607.731519,"logger":"controllers.ClusterPolicy","msg":"ClusterPolicy step completed","state:":"state-node-status-exporter","status":"disabled"}
{"level":"info","ts":1692994607.7502105,"logger":"controllers.ClusterPolicy","msg":"ClusterPolicy step completed","state:":"state-vgpu-manager","status":"disabled"}
{"level":"info","ts":1692994607.766168,"logger":"controllers.ClusterPolicy","msg":"ClusterPolicy step completed","state:":"state-vgpu-device-manager","status":"disabled"}
{"level":"info","ts":1692994607.7842896,"logger":"controllers.ClusterPolicy","msg":"ClusterPolicy step completed","state:":"state-sandbox-validation","status":"disabled"}
{"level":"info","ts":1692994607.804841,"logger":"controllers.ClusterPolicy","msg":"ClusterPolicy step completed","state:":"state-vfio-manager","status":"disabled"}
{"level":"info","ts":1692994607.822285,"logger":"controllers.ClusterPolicy","msg":"ClusterPolicy step completed","state:":"state-sandbox-device-plugin","status":"disabled"}
{"level":"info","ts":1692994607.845774,"logger":"controllers.ClusterPolicy","msg":"Kata Manager disabled, deleting all Kata RuntimeClasses"}
{"level":"info","ts":1692994607.8458114,"logger":"controllers.ClusterPolicy","msg":"ClusterPolicy step completed","state:":"state-kata-manager","status":"disabled"}
{"level":"info","ts":1692994607.8648727,"logger":"controllers.ClusterPolicy","msg":"ClusterPolicy step completed","state:":"state-cc-manager","status":"disabled"}
{"level":"info","ts":1692994607.8649228,"logger":"controllers.ClusterPolicy","msg":"ClusterPolicy isn't ready","states not ready":["state-operator-validation","state-device-plugin","state-dcgm-exporter","gpu-feature-discovery"]}
full log
gpu-operator-9974dbcfc-6ftnf.txt (2.9 MB)
Hi Ganindu,
For the issue when reinstalling gpu-operator in DGX station A100, could you please create a topic in DGX User Forum - NVIDIA Developer Forums or use About the DGX User Forum / Note: this is not NVIDIA Enterprise Support to generate a ticket for enterprise support? Since the issue is related to the software stack in DGX, and also is out of the scope of TAO, so it is better for you to get help from DGX team directly for better hints. More, you can also search topics or create topics on Issues Ā· NVIDIA/gpu-operator Ā· GitHub.
I lease one DGX-1(with V100) and run successfully for uninstalling and installing.
Step:
$ helm ls -n nvidia-gpu-operator
$ helm uninstall --wait gpu-operator-1693131264 -n nvidia-gpu-operator
$ sudo vim /etc/containerd/config.toml (follow Getting Started ā NVIDIA GPU Operator 23.9.0 documentation, section āContainerd:ā )
$ sudo systemctl restart containerd
$ helm install --wait --generate-name -n nvidia-gpu-operator --create-namespace nvidia/gpu-operator --set driver.enabled=false --set toolkit.enabled=false
Thatās great!! The DGX is and the driver version upto the latest version?
I lease a DGX-1(with V100) machine and run sudo apt update on it.
Below is latest info.
local-morganh@nvc-dgx1-001:~$ cat /etc/dgx-release
DGX_NAME="DGX Server"
DGX_PRETTY_NAME="NVIDIA DGX Server"
DGX_SWBUILD_DATE="2022-06-06-13-55-03"
DGX_SWBUILD_VERSION="5.3.1"
DGX_COMMIT_ID=""
DGX_PLATFORM="DGX Server for DGX-1"
DGX_SERIAL_NUMBER="xxxxxxxxxx"
local-morganh@nvc-dgx1-001:~$ nvidia-smi
Sun Aug 27 18:51:22 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.125.06 Driver Version: 525.125.06 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2... On | 00000000:06:00.0 Off | 0 |
| N/A 33C P0 43W / 300W | 0MiB / 32768MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 Tesla V100-SXM2... On | 00000000:07:00.0 Off | 0 |
| N/A 36C P0 44W / 300W | 0MiB / 32768MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 2 Tesla V100-SXM2... On | 00000000:0A:00.0 Off | 0 |
| N/A 34C P0 42W / 300W | 0MiB / 32768MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 3 Tesla V100-SXM2... On | 00000000:0B:00.0 Off | 0 |
| N/A 33C P0 43W / 300W | 0MiB / 32768MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 4 Tesla V100-SXM2... On | 00000000:85:00.0 Off | 0 |
| N/A 33C P0 43W / 300W | 0MiB / 32768MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 5 Tesla V100-SXM2... On | 00000000:86:00.0 Off | 0 |
| N/A 36C P0 44W / 300W | 0MiB / 32768MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 6 Tesla V100-SXM2... On | 00000000:89:00.0 Off | 0 |
| N/A 39C P0 44W / 300W | 0MiB / 32768MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 7 Tesla V100-SXM2... On | 00000000:8A:00.0 Off | 0 |
| N/A 36C P0 44W / 300W | 0MiB / 32768MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
BTW, I fixed āGPG errorā before sudo apt update. Then, $ sudo apt install nvidia-driver-525. Also, I install TAO5.0 via $ bash setup.sh install to install k8s, etc.
Thanks for this! my values are!
OS verison
g@dgx:~$ cat /etc/dgx-release
DGX_NAME="DGX Station A100"
DGX_PRETTY_NAME="NVIDIA DGX Station A100"
DGX_SWBUILD_DATE="2021-02-04-09-33-25"
DGX_SWBUILD_VERSION="5.0.2"
DGX_COMMIT_ID="bcf2581"
DGX_PLATFORM="DGX Station A100"
DGX_SERIAL_NUMBER="1560422011502"
DGX_OTA_VERSION="5.4.2"
DGX_OTA_DATE="Fri 27 Jan 2023 03:51:24 PM GMT"
DGX_OTA_VERSION="5.4.2"
DGX_OTA_DATE="Mon 06 Mar 2023 12:02:17 PM GMT"
DGX_OTA_VERSION="5.4.2"
DGX_OTA_DATE="Tue 14 Mar 2023 02:26:48 PM GMT"
DGX_OTA_VERSION="5.5.0"
DGX_OTA_DATE="Tue 04 Apr 2023 10:35:33 AM BST"
DGX_OTA_VERSION="6.0.11"
DGX_OTA_DATE="Tue Jun 6 04:32:24 PM BST 2023"
DGX_OTA_VERSION="6.1.0"
DGX_OTA_DATE="Tue 22 Aug 11:14:55 BST 2023"
g@dgx:~$
Driver verison
g@dgx:~$ nvidia-smi
Tue Aug 29 09:09:34 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.125.06 Driver Version: 525.125.06 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A100-SXM... On | 00000000:01:00.0 Off | 0 |
| N/A 33C P0 51W / 275W | 4MiB / 40960MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA A100-SXM... On | 00000000:47:00.0 Off | 0 |
| N/A 34C P0 52W / 275W | 4MiB / 40960MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 2 NVIDIA A100-SXM... On | 00000000:81:00.0 Off | 0 |
| N/A 33C P0 53W / 275W | 4MiB / 40960MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 3 NVIDIA DGX Display On | 00000000:C1:00.0 Off | N/A |
| 34% 34C P8 N/A / 50W | 5MiB / 4096MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 4 NVIDIA A100-SXM... On | 00000000:C2:00.0 Off | 0 |
| N/A 33C P0 50W / 275W | 4MiB / 40960MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 6213 G /usr/lib/xorg/Xorg 4MiB |
| 1 N/A N/A 6213 G /usr/lib/xorg/Xorg 4MiB |
| 2 N/A N/A 6213 G /usr/lib/xorg/Xorg 4MiB |
| 3 N/A N/A 6213 G /usr/lib/xorg/Xorg 4MiB |
| 4 N/A N/A 6213 G /usr/lib/xorg/Xorg 4MiB |
+-----------------------------------------------------------------------------+
the only difference to the setup (aprt from using a DGX station instead of a server) is in the /etc/containerd/config.toml file. (should I get rid of the extra lines?)
[plugins."io.containerd.grpc.v1.cri".containerd]
default_runtime_name = "nvidia"
disable_snapshot_annotations = true
discard_unpacked_layers = false
ignore_rdt_not_enabled_errors = false
no_pivot = false
snapshotter = "overlayfs"
I recall that I only add lines.
20230828_dgx-1_yaml.txt (7.6 KB)
Attach my yaml for reference.
The differences I see are
(yours on the left)
ordering
runc options
I also made a case in the dgx support forum!
full file
containerd_yaml.txt (7.3 KB)
OK.
I did not do below,
after checking, what I did is as below. See āaddedā lines.
78 [plugins."io.containerd.grpc.v1.cri".containerd]
79 disable_snapshot_annotations = true
80 discard_unpacked_layers = false
81 ignore_rdt_not_enabled_errors = false
82 no_pivot = false
83 snapshotter = "overlayfs"
84 default_runtime_name = "nvidia" //added
85
86 [plugins."io.containerd.grpc.v1.cri".containerd.default_runtime]
87 base_runtime_spec = ""
88 cni_conf_dir = ""
89 cni_max_conf_num = 0
90 container_annotations = []
91 pod_annotations = []
92 privileged_without_host_devices = false
93 runtime_engine = ""
94 runtime_path = ""
95 runtime_root = ""
96 runtime_type = ""
97
98 [plugins."io.containerd.grpc.v1.cri".containerd.default_runtime.options]
99
100 [plugins."io.containerd.grpc.v1.cri".containerd.runtimes]
101 [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia] //added
102 privileged_without_host_devices = false //added
103 runtime_engine = "" //added
104 runtime_root = "" //added
105 runtime_type = "io.containerd.runc.v2" //added
106 [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options] //added
107 BinaryName = "/usr/bin/nvidia-container-runtime" //added
108
109 [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
110 base_runtime_spec = ""
111 cni_conf_dir = ""
112 cni_max_conf_num = 0
113 container_annotations = []
114 pod_annotations = []
115 privileged_without_host_devices = false
116 runtime_engine = ""
117 runtime_path = ""
118 runtime_root = ""
119 runtime_type = "io.containerd.runc.v2"
Maybe the changes you didnāt make is because your DGX os is 5.3.1
I am at 6.1.0
Not sure itās the cause of the issue but in my eyes it just contributes to the fact that the things are different. I will try to get things as similar as possible (within reason) and do another purge install run for the gpu-operator.


