This one: nvidia-driver-daemonset-gwmlj
Sorry for the delay
kubectl logs -n nvidia-gpu-operator nvidia-driver-daemonset-gwmlj
Error from server (BadRequest): container "nvidia-driver-ctr" in pod "nvidia-driver-daemonset-gwmlj" is waiting to start: PodInitializing
while TASK [Waiting for the Cluster to become available]
in the other terminal or SIGINT that?
Yes, keep as is. Should Not ctrl+c.
kubectl delete crd clusterpolicies.nvidia.com
customresourcedefinition.apiextensions.k8s.io "clusterpolicies.nvidia.com" deleted
That seems to be a change… hmm…
Then monitor the original terminal to check if the stuck is gone.
yeah doing stuff!!!
TASK [Waiting for the Cluster to become available] ****************************************************************************************************************************************************************************************************************
changed: [172.16.3.2]
TASK [Validate kubernetes cluster is up] **************************************************************************************************************************************************************************************************************************
changed: [172.16.3.2]
TASK [Check Operating System Version] *****************************************************************************************************************************************************************************************************************************
changed: [172.16.3.2]
TASK [Check Docker Version] ***************************************************************************************************************************************************************************************************************************************
skipping: [172.16.3.2]
TASK [Check Containerd Version] ***********************************************************************************************************************************************************************************************************************************
changed: [172.16.3.2]
TASK [Check Kubernetes Version] ***********************************************************************************************************************************************************************************************************************************
changed: [172.16.3.2]
TASK [Check Helm Version] *****************************************************************************************************************************************************************************************************************************************
changed: [172.16.3.2]
TASK [Check Nvidia GPU Operator Toolkit versions] *****************************************************************************************************************************************************************************************************************
changed: [172.16.3.2]
TASK [Check Nvidia K8s Device versions] ***************************************************************************************************************************************************************************************************************************
changed: [172.16.3.2]
TASK [Check GPU Operator Nvidia Container Driver versions] ********************************************************************************************************************************************************************************************************
changed: [172.16.3.2]
TASK [Check GPU Operator Nvidia DGCM Versions] ********************************************************************************************************************************************************************************************************************
changed: [172.16.3.2]
TASK [Check GPU Operator Node Feature Discovery Versions] *********************************************************************************************************************************************************************************************************
changed: [172.16.3.2]
TASK [Check GPU Operator GPU Feature Discovery Versions] **********************************************************************************************************************************************************************************************************
changed: [172.16.3.2]
TASK [Check Nvidia GPU Operator versions] *************************************************************************************************************************************************************************************************************************
changed: [172.16.3.2]
TASK [Check Nvidia MIG Maanger versions] **************************************************************************************************************************************************************************************************************************
changed: [172.16.3.2]
TASK [Check Nvidia validator versions] ****************************************************************************************************************************************************************************************************************************
changed: [172.16.3.2]
TASK [Check Nvidia DCGM Exporter versions] ************************************************************************************************************************************************************************************************************************
changed: [172.16.3.2]
TASK [Check NVIDIA Driver Version] ********************************************************************************************************************************************************************************************************************************
skipping: [172.16.3.2]
TASK [Check NVIDIA Container ToolKit Version] *********************************************************************************************************************************************************************************************************************
skipping: [172.16.3.2]
TASK [Check Mellanox Network Operator version] ********************************************************************************************************************************************************************************************************************
skipping: [172.16.3.2]
TASK [Check Mellanox MOFED Driver Version] ************************************************************************************************************************************************************************************************************************
skipping: [172.16.3.2]
TASK [Check RDMA Shared Device Plugin Version] ********************************************************************************************************************************************************************************************************************
skipping: [172.16.3.2]
TASK [Check SRIOV Device Plugin Version] **************************************************************************************************************************************************************************************************************************
skipping: [172.16.3.2]
TASK [Check Container Networking Plugins Version] *****************************************************************************************************************************************************************************************************************
skipping: [172.16.3.2]
TASK [Check Multus Version] ***************************************************************************************************************************************************************************************************************************************
skipping: [172.16.3.2]
TASK [Check Whereabouts Version] **********************************************************************************************************************************************************************************************************************************
skipping: [172.16.3.2]
TASK [check master node is up and running] ************************************************************************************************************************************************************************************************************************
changed: [172.16.3.2]
TASK [Check all pods are running for Kubernetes] ******************************************************************************************************************************************************************************************************************
changed: [172.16.3.2]
TASK [validate helm installed] ************************************************************************************************************************************************************************************************************************************
changed: [172.16.3.2]
TASK [Collecting Number of GPU's] *********************************************************************************************************************************************************************************************************************************
changed: [172.16.3.2]
TASK [Create NVIDIA-SMI yaml] *************************************************************************************************************************************************************************************************************************************
skipping: [172.16.3.2]
TASK [Report Operating System Version of RHEL/CentOS] *************************************************************************************************************************************************************************************************************
skipping: [172.16.3.2]
TASK [Report Operating System Version of Ubuntu] ******************************************************************************************************************************************************************************************************************
ok: [172.16.3.2] => {
"msg": "Ubuntu Operating System version 20.04.5 LTS (Focal Fossa)"
}
TASK [Report Docker Version] **************************************************************************************************************************************************************************************************************************************
skipping: [172.16.3.2]
TASK [Report Containerd Version] **********************************************************************************************************************************************************************************************************************************
ok: [172.16.3.2] => {
"msg": " Containerd Version v1.6.2"
}
TASK [Report Kubernetes Version] **********************************************************************************************************************************************************************************************************************************
ok: [172.16.3.2] => {
"msg": "Kubernetes Version v1.23.5"
}
TASK [Report Helm Version] ****************************************************************************************************************************************************************************************************************************************
ok: [172.16.3.2] => {
"msg": "Helm Version 3.8.1"
}
TASK [Report Nvidia GPU Operator version] *************************************************************************************************************************************************************************************************************************
ok: [172.16.3.2] => {
"msg": "Nvidia GPU Operator versions v1.10.1"
}
TASK [Report Nvidia Container Driver Version] *********************************************************************************************************************************************************************************************************************
ok: [172.16.3.2] => {
"msg": "Nvidia Container Driver Version "
}
TASK [Report GPU Operator NV Toolkit Driver] **********************************************************************************************************************************************************************************************************************
ok: [172.16.3.2] => {
"msg": "NV Container Toolkit Version "
}
TASK [Report Nvidia Container Driver Version] *********************************************************************************************************************************************************************************************************************
skipping: [172.16.3.2]
TASK [Report GPU Operator NV Toolkit Driver] **********************************************************************************************************************************************************************************************************************
skipping: [172.16.3.2]
TASK [Report K8sDevice Plugin Version] ****************************************************************************************************************************************************************************************************************************
ok: [172.16.3.2] => {
"msg": "Nvidia K8s Device Plugin Version "
}
TASK [Report Data Center GPU Manager (DCGM) Version] **************************************************************************************************************************************************************************************************************
ok: [172.16.3.2] => {
"msg": "Data Center GPU Manager (DCGM) Version "
}
TASK [Report Node Feature Discovery Version] **********************************************************************************************************************************************************************************************************************
ok: [172.16.3.2] => {
"msg": "Node Feature Discovery Version v0.10.1"
}
TASK [Report GPU Feature Discovery Version] ***********************************************************************************************************************************************************************************************************************
ok: [172.16.3.2] => {
"msg": "GPU Feature Discovery Version "
}
TASK [Report Nvidia validator version] ****************************************************************************************************************************************************************************************************************************
ok: [172.16.3.2] => {
"msg": "Nvidia validator version "
}
TASK [Report Nvidia DCGM Exporter version] ************************************************************************************************************************************************************************************************************************
ok: [172.16.3.2] => {
"msg": "Nvidia DCGM Exporter version "
}
TASK [Report Nvidia MIG Maanger version] **************************************************************************************************************************************************************************************************************************
ok: [172.16.3.2] => {
"msg": "Nvidia MIG Maanger version "
}
TASK [Componenets Matrix Versions Vs Installed Versions] **********************************************************************************************************************************************************************************************************
changed: [172.16.3.2]
TASK [Report Versions] ********************************************************************************************************************************************************************************************************************************************
ok: [172.16.3.2] => {
"msg": [
"===========================================================================================",
" Components Matrix Version || Installed Version ",
"===========================================================================================",
"GPU Operator Version v1.10.1 || v1.10.1",
"Nvidia Container Driver Version 510.47.03 || ",
"GPU Operator NV Toolkit Driver v1.9.0 || ",
"K8sDevice Plugin Version v0.11.0 || ",
"Data Center GPU Manager(DCGM) Version 2.3.4-2.6.4 || ",
"Node Feature Discovery Version v0.10.1 || v0.10.1",
"GPU Feature Discovery Version v0.5.0 || ",
"Nvidia validator version v1.10.1 || ",
"Nvidia MIG Manager version 0.3.0 || ",
"",
"Note: NVIDIA Mig Manager is valid for only Amphere GPU's like A100, A30",
"",
"Please validate between Matrix Version and Installed Version listed above"
]
}
TASK [Componenets Matrix Versions Vs Installed Versions] **********************************************************************************************************************************************************************************************************
skipping: [172.16.3.2]
TASK [Report Versions] ********************************************************************************************************************************************************************************************************************************************
skipping: [172.16.3.2]
TASK [Componenets Matrix Versions Vs Installed Versions] **********************************************************************************************************************************************************************************************************
skipping: [172.16.3.2]
TASK [Report Versions] ********************************************************************************************************************************************************************************************************************************************
skipping: [172.16.3.2]
TASK [Validate the GPU Operator pods State] ***********************************************************************************************************************************************************************************************************************
changed: [172.16.3.2]
TASK [Report GPU Operator Pods] ***********************************************************************************************************************************************************************************************************************************
ok: [172.16.3.2] => {
"msg": [
"nvidia-gpu-operator gpu-operator-1678981498-node-feature-discovery-master-79ddznwhj 1/1 Running 0 5m35s",
"nvidia-gpu-operator gpu-operator-1678981498-node-feature-discovery-worker-chxdx 1/1 Running 4 (5m17s ago) 43m",
"nvidia-gpu-operator gpu-operator-7bfc5f55-ghhvs 1/1 Running 0 5m35s"
]
}
TASK [Validate GPU Operator Version for Cloud Native Core 6.2 and 7.0] ********************************************************************************************************************************************************************************************
changed: [172.16.3.2]
TASK [Validate GPU Operator Version for Cloud Native Core 6.1] ****************************************************************************************************************************************************************************************************
changed: [172.16.3.2]
TASK [Validating the nvidia-smi on NVIDIA Cloud Native Core] ******************************************************************************************************************************************************************************************************
skipping: [172.16.3.2]
TASK [Validating the nvidia-smi on NVIDIA Cloud Native Core] ******************************************************************************************************************************************************************************************************
skipping: [172.16.3.2]
TASK [Validating the nvidia-smi on NVIDIA Cloud Native Core] ******************************************************************************************************************************************************************************************************
ASYNC FAILED on 172.16.3.2: jid=184764697792.434153
fatal: [172.16.3.2]: FAILED! => {"ansible_job_id": "184764697792.434153", "changed": true, "cmd": ["kubectl", "run", "gpu-test", "--rm", "-t", "-i", "--restart=Never", "--image=nvidia/cuda:11.6.0-base-ubuntu20.04", "--limits=nvidia.com/gpu=", "--", "nvidia-smi"], "delta": "0:00:11.738078", "end": "2023-03-16 16:29:29.685476", "finished": 1, "msg": "non-zero return code", "rc": 128, "results_file": "/home/g/.ansible_async/184764697792.434153", "start": "2023-03-16 16:29:17.947398", "started": 1, "stderr": "Flag --limits has been deprecated, has no effect and will be removed in 1.24.\npod default/gpu-test terminated (StartError)\nfailed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: exec: \"nvidia-smi\": executable file not found in $PATH: unknown", "stderr_lines": ["Flag --limits has been deprecated, has no effect and will be removed in 1.24.", "pod default/gpu-test terminated (StartError)", "failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: exec: \"nvidia-smi\": executable file not found in $PATH: unknown"], "stdout": "pod \"gpu-test\" deleted", "stdout_lines": ["pod \"gpu-test\" deleted"]}
...ignoring
TASK [Report Nvidia SMI Validation] *******************************************************************************************************************************************************************************************************************************
ok: [172.16.3.2] => {
"msg": [
"pod \"gpu-test\" deleted"
]
}
TASK [Validating the CUDA with GPU] *******************************************************************************************************************************************************************************************************************************
ASYNC POLL on 172.16.3.2: jid=754237448857.435110 started=1 finished=0
ASYNC POLL on 172.16.3.2: jid=754237448857.435110 started=1 finished=0
ASYNC POLL on 172.16.3.2: jid=754237448857.435110 started=1 finished=0
ASYNC FAILED on 172.16.3.2: jid=754237448857.435110
fatal: [172.16.3.2]: FAILED! => {"ansible_job_id": "754237448857.435110", "changed": true, "cmd": "kubectl run cuda-vector-add --rm -t -i --restart=Never --image=k8s.gcr.io/cuda-vector-add:v0.1", "delta": "0:01:00.060847", "end": "2023-03-16 16:30:33.716430", "finished": 1, "msg": "non-zero return code", "rc": 1, "results_file": "/home/g/.ansible_async/754237448857.435110", "start": "2023-03-16 16:29:33.655583", "started": 1, "stderr": "error: timed out waiting for the condition", "stderr_lines": ["error: timed out waiting for the condition"], "stdout": "pod \"cuda-vector-add\" deleted", "stdout_lines": ["pod \"cuda-vector-add\" deleted"]}
...ignoring
TASK [Validating the nvidia-smi on NVIDIA Cloud Native Core 7.0] **************************************************************************************************************************************************************************************************
skipping: [172.16.3.2]
TASK [Report Nvidia SMI Validation] *******************************************************************************************************************************************************************************************************************************
skipping: [172.16.3.2]
TASK [Report Cuda Validation] *************************************************************************************************************************************************************************************************************************************
ok: [172.16.3.2] => {
"msg": [
"pod \"cuda-vector-add\" deleted"
]
}
TASK [Report Network Operator version] ****************************************************************************************************************************************************************************************************************************
skipping: [172.16.3.2]
TASK [Report Mellanox MOFED Driver Version] ***********************************************************************************************************************************************************************************************************************
skipping: [172.16.3.2]
TASK [Report RDMA Shared Device Plugin Version] *******************************************************************************************************************************************************************************************************************
skipping: [172.16.3.2]
TASK [Report SRIOV Device Plugin Version] *************************************************************************************************************************************************************************************************************************
skipping: [172.16.3.2]
TASK [Report Container Networking Plugin Version] *****************************************************************************************************************************************************************************************************************
skipping: [172.16.3.2]
TASK [Report Multus Version] **************************************************************************************************************************************************************************************************************************************
skipping: [172.16.3.2]
TASK [Report Whereabouts Version] *********************************************************************************************************************************************************************************************************************************
skipping: [172.16.3.2]
TASK [Status Check] ***********************************************************************************************************************************************************************************************************************************************
changed: [172.16.3.2]
TASK [debug] ******************************************************************************************************************************************************************************************************************************************************
ok: [172.16.3.2] => {
"msg": "All tasks should be changed or ok, if it's failed or ignoring means that validation task failed."
}
PLAY RECAP ********************************************************************************************************************************************************************************************************************************************************
172.16.3.2 : ok=47 changed=27 unreachable=0 failed=0 skipped=30 rescued=0 ignored=2
PLAY [all] ********************************************************************************************************************************************************************************************************************************************************
TASK [Gathering Facts] ********************************************************************************************************************************************************************************************************************************************
ok: [172.16.3.2]
TASK [install nfs-common] *****************************************************************************************************************************************************************************************************************************************
ok: [172.16.3.2]
PLAY [master[0]] **************************************************************************************************************************************************************************************************************************************************
TASK [Gathering Facts] ********************************************************************************************************************************************************************************************************************************************
ok: [172.16.3.2]
TASK [install nginx ingress controller] ***************************************************************************************************************************************************************************************************************************
changed: [172.16.3.2] => (item=helm repo add ingress-nginx https://kubernetes.github.io/ingress-nginx --force-update)
changed: [172.16.3.2] => (item=helm repo update)
changed: [172.16.3.2] => (item=helm upgrade --install ingress-nginx ingress-nginx/ingress-nginx --set controller.service.type=NodePort --set controller.service.nodePorts.http=32080 --set controller.service.nodePorts.https=32443)
TASK [install nfs-kernel-server] **********************************************************************************************************************************************************************************************************************************
changed: [172.16.3.2]
TASK [create export directory] ************************************************************************************************************************************************************************************************************************************
changed: [172.16.3.2]
TASK [create export config] ***************************************************************************************************************************************************************************************************************************************
changed: [172.16.3.2]
TASK [export filesystem] ******************************************************************************************************************************************************************************************************************************************
changed: [172.16.3.2]
TASK [restart nfs-kernel-server] **********************************************************************************************************************************************************************************************************************************
changed: [172.16.3.2]
TASK [install storage provisioner] ********************************************************************************************************************************************************************************************************************************
changed: [172.16.3.2] => (item=helm repo add nfs-subdir-external-provisioner https://kubernetes-sigs.github.io/nfs-subdir-external-provisioner/ --force-update)
changed: [172.16.3.2] => (item=helm repo update)
changed: [172.16.3.2] => (item=helm upgrade --install --reset-values --cleanup-on-fail --create-namespace --atomic --wait nfs-subdir-external-provisioner nfs-subdir-external-provisioner/nfs-subdir-external-provisioner --set nfs.server=172.16.3.2 --set nfs.path=/mnt/nfs_share)
TASK [setup imagepullsecret] **************************************************************************************************************************************************************************************************************************************
changed: [172.16.3.2] => (item=kubectl delete secret 'imagepullsecret' --ignore-not-found)
changed: [172.16.3.2] => (item=kubectl create secret docker-registry 'imagepullsecret' --docker-server='nvcr.io' --docker-username='$oauthtoken' --docker-password=a3NvZ2xvcnUxOWNpMjcxM201YzdnMjdtN3Y6ZTZiZDIzNWItNTc3Mi00OTY3LWI3YTQtMmFiYzIzMDNjMjEx --docker-email=ganindu@gmail.com --namespace='default')
TASK [capture node names] *****************************************************************************************************************************************************************************************************************************************
changed: [172.16.3.2]
TASK [setup nvidia-smi on each node] ******************************************************************************************************************************************************************************************************************************
changed: [172.16.3.2] => (item=dgx)
TASK [label nodes with accelerator] *******************************************************************************************************************************************************************************************************************************
changed: [172.16.3.2] => (item=dgx)
TASK [copy tao-toolkit-api helm values to master] *****************************************************************************************************************************************************************************************************************
changed: [172.16.3.2]
TASK [install tao-toolkit-api] ************************************************************************************************************************************************************************************************************************************
changed: [172.16.3.2]
PLAY [all] ********************************************************************************************************************************************************************************************************************************************************
TASK [Gathering Facts] ********************************************************************************************************************************************************************************************************************************************
ok: [172.16.3.2]
TASK [capture cluster ips] ****************************************************************************************************************************************************************************************************************************************
ok: [172.16.3.2 -> localhost]
TASK [store cluster ips] ******************************************************************************************************************************************************************************************************************************************
changed: [172.16.3.2]
TASK [copy ~/.kube/config to /tmp/kube-config] ********************************************************************************************************************************************************************************************************************
changed: [172.16.3.2]
TASK [ensure api endpoint uses publicly accessible ip] ************************************************************************************************************************************************************************************************************
skipping: [172.16.3.2]
TASK [generify context name] **************************************************************************************************************************************************************************************************************************************
skipping: [172.16.3.2]
TASK [replace generic cluster user and context name in kubeconfig with provided cluster name] *********************************************************************************************************************************************************************
skipping: [172.16.3.2]
TASK [fetch file /tmp/cluster_ips from remote to local] ***********************************************************************************************************************************************************************************************************
ok: [172.16.3.2]
TASK [fetch /tmp/kube-config from remote to local] ****************************************************************************************************************************************************************************************************************
ok: [172.16.3.2]
PLAY [localhost] **************************************************************************************************************************************************************************************************************************************************
TASK [Gathering Facts] ********************************************************************************************************************************************************************************************************************************************
ok: [localhost]
TASK [register cluster ips] ***************************************************************************************************************************************************************************************************************************************
changed: [localhost]
TASK [add an kubernetes apt signing key for ubuntu] ***************************************************************************************************************************************************************************************************************
ok: [localhost]
TASK [adding kubernetes apt repository for ubuntu] ****************************************************************************************************************************************************************************************************************
ok: [localhost]
TASK [install kubectl] ********************************************************************************************************************************************************************************************************************************************
ok: [localhost]
TASK [check if helm is installed] *********************************************************************************************************************************************************************************************************************************
changed: [localhost]
TASK [install helm on ubuntu 18.04] *******************************************************************************************************************************************************************************************************************************
skipping: [localhost] => (item=curl -O https://get.helm.sh/helm-v3.3.3-linux-amd64.tar.gz)
skipping: [localhost] => (item=tar -xvzf helm-v3.3.3-linux-amd64.tar.gz)
skipping: [localhost] => (item=cp linux-amd64/helm /usr/local/bin/)
skipping: [localhost] => (item=chmod 755 /usr/local/bin/helm)
skipping: [localhost] => (item=rm -rf helm-v3.3.3-linux-amd64.tar.gz linux-amd64)
TASK [install helm on ubuntu 20.04] *******************************************************************************************************************************************************************************************************************************
skipping: [localhost] => (item=curl -O https://get.helm.sh/helm-v3.8.1-linux-amd64.tar.gz)
skipping: [localhost] => (item=tar -xvzf helm-v3.8.1-linux-amd64.tar.gz)
skipping: [localhost] => (item=cp linux-amd64/helm /usr/local/bin/)
skipping: [localhost] => (item=chmod 755 /usr/local/bin/helm)
skipping: [localhost] => (item=rm -rf helm-v3.8.1-linux-amd64.tar.gz linux-amd64)
TASK [create kube directory] **************************************************************************************************************************************************************************************************************************************
ok: [localhost]
TASK [ensure kubeconfig file exists] ******************************************************************************************************************************************************************************************************************************
changed: [localhost]
TASK [merge kubeconfig to existing] *******************************************************************************************************************************************************************************************************************************
changed: [localhost]
TASK [make merged-kubeconfig default] *****************************************************************************************************************************************************************************************************************************
changed: [localhost]
PLAY RECAP ********************************************************************************************************************************************************************************************************************************************************
172.16.3.2 : ok=22 changed=15 unreachable=0 failed=0 skipped=3 rescued=0 ignored=0
localhost : ok=10 changed=5 unreachable=0 failed=0 skipped=2 rescued=0 ignored=0
(K8PY) g@dgx:~/Workspace/sandbox/TAO/getting_started_v4.0.0/setup/quickstart_api_bare_metal$
I think the block is goone now!! Great stuff (can I now register this with my kube master on the cpu server ? )
Great. It is working now.
Then, for getting started, you can download notebooks. Remote Client - NVIDIA Docs or refer to the blog https://developer.nvidia.com/blog/training-like-an-ai-pro-using-tao-automl/
Thanks a lot I think The trick was to effectively delete the crd
thing.
However I still don’t have nvidia-smi
and nvsm show health
is still Unhealthy
We want to be able to use the dgx as we used to as well? (is that a deal breaker?, if so we might have to revert all this sadly)
Officially, current method will install tao api with gpu driver uninstalled.
And actually we have some ways to install tao api on non-bare-metal machines. That means using tao-api in case who does not want to uninstall gpu driver and other sw.
We propose to install k8s first then deploy tao api. It is a bit comprehensive and not mentioned in user guide yet. User can follow this guide to install k8s cluster with GPU support firstly: https://docs.nvidia.com/datacenter/cloud-native/kubernetes/install-k8s.html#option-2-installing-kubernetes-using-kubeadm . Then they could follow https://docs.nvidia.com/tao/tao-toolkit/text/tao_toolkit_api/api_deployment.html to install tao-toolkit-api
Thanks a lot! i will accept the solution and get back to you s soon as possible! Thanks a lot for the patience!
Thanks Vlad!! great stuff!!
Excuse me. After I ran the command, the script truly ran the following successfully.
However, I found that the nvidia-gpu-operators were decreased to 3 and a pod show the error message:
1 reflector.go:138] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:250: Failed to watch *v1.ClusterPolicy: failed to list *v1.ClusterPolicy: the server could not find the requested resource (get clusterpolicies.nvidia.com)
How could I do to deal with the problem ?
I’m not sure where you stand but I think we deleted the cluster policy while in a stuck state so it(the process that was watching) detectected the change and fixed the issue?
Btw I managed to get method 1 working. while not sacrificing the gpu driver or nvsm health :D
I had to define/declarre presistant volume and manually patch the PV to the PVC(persistant volume claim). here is a very abstact dump of information but if anyone wants a more cleaner guide let me know.
Thanks a lot @Morganh and @vkhomyakov.
Cheers,
Ganindu.
Excuse me. Did you met the problem that Kubelete is always in status of auto-restart before run the command: sudo kubeadm init --pod-network-cidr=192.168.0.0/16 ?
How should I do to make Kubelete is running ? The picture below is the contents of journalctl -xefu kubelet
And the picture below is the contents of journalctl -u kubelet
Thank you for your help in advance.
Hey swaka,
I don’t recall this happening to me, sorry!
I think you might need to explain the context of the problem a little bit more (otherwise it is a bit confusing even begining to unravel.) if your setup and the topology is quite different to of this question I recommend opening a new thread (because this is already classified as solved) and lay out the probelm in a way there is a chance to replicate outside your system (this process will actually solve a large proportion of problems; as to explain to someone else you need to explain it to yourself ), finally If you clearly lay out the problem and still think the people invlved in this thread my be helpful please feel free to tag (I don’t mind (: )
Cheers,
Ganindu.
This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.