HI Team
Below is the error in setting up of K8s in BCM
I 12-Apr-2023 13:55:03 | cmsetup.engine.cmlog | ## Progress: 38
I 12-Apr-2023 13:55:03 | cmsetup.engine.cmlog | #### stage: kubernetes: Check Local Path Provisioner Version
D 12-Apr-2023 13:55:03 | exec_helpers.Subprocess | Executing command:
'/cm/local/apps/kubernetes/current/bin/helm show chart /cm/shared/apps/kubernetes-local-path-provisioner/current/helm/cm-kubernetes-local-path-provisioner-*.tgz'
D 12-Apr-2023 13:55:03 | exec_helpers.Subprocess | apiVersion: v2
D 12-Apr-2023 13:55:03 | exec_helpers.Subprocess | appVersion: 0.0.23
D 12-Apr-2023 13:55:03 | exec_helpers.Subprocess | description: A Helm chart for local path provisioner
D 12-Apr-2023 13:55:03 | exec_helpers.Subprocess | name: cm-kubernetes-local-path-provisioner
D 12-Apr-2023 13:55:03 | exec_helpers.Subprocess | type: application
D 12-Apr-2023 13:55:03 | exec_helpers.Subprocess | version: 0.0.23
D 12-Apr-2023 13:55:03 | exec_helpers.Subprocess |
D 12-Apr-2023 13:55:03 | exec_helpers.Subprocess | Command '/cm/local/apps/kubernetes/current/bin/helm show chart /cm/shared/apps/kubernetes-local-path-provisioner/current/helm/cm-kubernetes-local-path-provisioner-*.tgz' exit code: EX_OK<0(0x00)>
D 12-Apr-2023 13:55:03 | cmsetup.engine.cmsetup | Disconnect from cluster.
I 12-Apr-2023 13:55:03 | pythoncm.entity_change | Stop event change watcher
I 12-Apr-2023 13:55:03 | cmsetup.engine.cmsetup |
I 12-Apr-2023 13:55:03 | cmsetup.engine.cmsetup | Took: 00:07 min.
I 12-Apr-2023 13:55:03 | cmsetup.engine.cmsetup | Progress: 38/100
I 12-Apr-2023 13:55:03 | cmsetup.engine.cmsetup | ################### Finished execution for 'Kubernetes Setup', status: Failed (aborted)
I 12-Apr-2023 13:55:03 | cmsetup.engine.cmsetup |
I 12-Apr-2023 13:55:03 | cmsetup.engine.cmsetup |
E 12-Apr-2023 13:55:03 | cmsetup.engine.main | Exception: 'Version of the Local Path Provosioner 0.0.23 is too new for cluster-tools. Latest supported is 0.0.20. Please update.'
E 12-Apr-2023 13:55:03 | cmsetup.engine.main | Version of the Local Path Provosioner 0.0.23 is too new for cluster-tools. Latest supported is 0.0.20. Please update.
I 12-Apr-2023 13:55:03 | cmsetup.engine.main |
I 12-Apr-2023 13:55:03 | cmsetup.engine.main | See the /var/log/cm-setup.log for details.
I 12-Apr-2023 13:55:03 | cmsetup.engine.main |
D 12-Apr-2023 13:55:03 | cmsetup.engine.cmlog |
D 12-Apr-2023 13:55:03 | cmsetup.engine.cmlog | ################################ END ###########################################
D 12-Apr-2023 13:55:03 | cmsetup.engine.cmlog |
I managed to fix the above but my cmd service on compute nodes are failing after reboot
[root@node001 ~]# systemctl status cmd.service
● cmd.service - Bright Computing Cluster Manager daemon
Loaded: loaded (/usr/lib/systemd/system/cmd.service; disabled; vendor preset: disabled)
Active: inactive (dead)
[root@node001 ~]# systemctl restart cmd.service
Job for cmd.service failed because the control process exited with error code.
See "systemctl status cmd.service" and "journalctl -xe" for details.
Apr 12 14:56:40 node001 safe_cmd[9130]: + logger -s -p daemon.crit -t CMDaemon 'CMDaemon indicates it could not start (most likely due to missing library)'
Apr 12 14:56:40 node001 CMDaemon[9157]: CMDaemon indicates it could not start (most likely due to missing library)
Apr 12 14:56:40 node001 safe_cmd[9157]: <26>Apr 12 14:56:40 CMDaemon: CMDaemon indicates it could not start (most likely due to missing library)
Apr 12 14:56:40 node001 safe_cmd[9130]: + sendFailedToStartWarningMail
Apr 12 14:56:40 node001 safe_cmd[9158]: + cat
Apr 12 14:56:40 node001 safe_cmd[9160]: ++ hostname
Apr 12 14:56:40 node001 safe_cmd[9161]: ++ hostname
Apr 12 14:56:40 node001 safe_cmd[9159]: + mail -s 'CMDaemon was unable to start on node001!' root@master
Apr 12 14:56:40 node001 safe_cmd[9164]: ++ ls -hal '/var/spool/cmd/core.*'
Apr 12 14:56:40 node001 safe_cmd[9166]: ++ ls -hal '/var/lib/systemd/coredump/cmd.core.*'
Apr 12 14:56:40 node001 safe_cmd[9130]: + runHook failed_to_start_warning_hook
Apr 12 14:56:40 node001 safe_cmd[9130]: + filename=/cm/local/apps/cmd/sbin/failed_to_start_warning_hook
Apr 12 14:56:40 node001 safe_cmd[9130]: + '[' -x /cm/local/apps/cmd/sbin/failed_to_start_warning_hook ']'
Apr 12 14:56:40 node001 safe_cmd[9130]: + break
Apr 12 14:56:40 node001 safe_cmd[9130]: + '[' -s /var/spool/cmd/cmd.output.bYrz7xTIRv ']'
Apr 12 14:56:40 node001 safe_cmd[9169]: + echo 'CMDaemon stdout (note that timestamps could be incorrect):'
Apr 12 14:56:40 node001 safe_cmd[9170]: + logger -s -p daemon.crit -t CMDaemon
Apr 12 14:56:40 node001 safe_cmd[9169]: + cat /var/spool/cmd/cmd.output.bYrz7xTIRv
Apr 12 14:56:40 node001 CMDaemon[9170]: CMDaemon stdout (note that timestamps could be incorrect):
Apr 12 14:56:40 node001 safe_cmd[9170]: <26>Apr 12 14:56:40 CMDaemon: CMDaemon stdout (note that timestamps could be incorrect):
Apr 12 14:56:40 node001 safe_cmd[9170]: <26>Apr 12 14:56:40 CMDaemon: /cm/local/apps/cmd/sbin/cmd: error while loading shared libraries: libdcgm.so.2: cannot open shared object file: No such file or directory
Apr 12 14:56:40 node001 CMDaemon[9170]: /cm/local/apps/cmd/sbin/cmd: error while loading shared libraries: libdcgm.so.2: cannot open shared object file: No such file or directory
Apr 12 14:56:40 node001 safe_cmd[9130]: + rm -f /var/spool/cmd/cmd.output.bYrz7xTIRv
Apr 12 14:56:40 node001 safe_cmd[9130]: + exit 127
Apr 12 14:56:40 node001 systemd[1]: cmd.service: Main process exited, code=exited, status=127/n/a
Apr 12 14:56:41 node001 wait_cmd[9131]: + let retries=55-1
Apr 12 14:56:41 node001 wait_cmd[9131]: + '[' 54 -gt 0 ']'
Apr 12 14:56:41 node001 wait_cmd[9131]: + '[' -z 9130 -o '!' -d /proc/9130 ']'
Apr 12 14:56:41 node001 wait_cmd[9131]: + exit 1
Apr 12 14:56:41 node001 systemd[1]: cmd.service: Control process exited, code=exited status=1
Apr 12 14:56:41 node001 systemd[1]: cmd.service: Failed with result 'exit-code'.
-- Subject: Unit failed
-- Defined-By: systemd
-- Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- The unit cmd.service has entered the 'failed' state with result 'exit-code'.
Apr 12 14:56:41 node001 systemd[1]: Failed to start Bright Computing Cluster Manager daemon.
## Progress: 59
#### stage: kubernetes: Apply Sysctl Config
## Progress: 60
#### stage: kubernetes: Collection Nodes Reboot
node001: reboot requested
node002: reboot requested
node003: reboot requested
Press ctrl+c to abort waiting and continue with deployment
Waiting for nodes to start reboot
Going to wait up to 15 minutes for the nodes to come back up.
Your nodes have already been rebooting for a significant amount of time (half of the total timeout time). This might be expected due to relatively slow network/hardware or due to large software images. Please make sure the nodes are continuing the provisioning process properly. If they are, ignore this message, sit back, and wait for the process to finish.
Not all of the nodes came back up after the reboot (node001, node002, node003). However, none of those were mandatory for the setup process . Waited for 15 minutes, not waiting any longer, setup process will proceed.
## Progress: 61
#### stage: kubernetes: Disable Swap On Compute Nodes
## Progress: 62
#### stage: kubernetes: Create Exclude List For Gpu Operator
## Progress: 63
#### stage: kubernetes: Create Configuration Overlay
Creating configuration overlay kube-default-etcd
Adding nodes
## Progress: 64
#### stage: kubernetes: Assign Etcd Role
Assigning EtcdHostRole role
## Progress: 65
#### stage: kubernetes: Collection Data Node Enable
## Progress: 68
#### stage: kubernetes: Assign Containerd Role
## Progress: 70
#### stage: kubernetes: Assign Api Server Role
Assigning KubernetesApiServerRole role
## Progress: 71
#### stage: kubernetes: Assign Api Server Proxy Role
Assigning KubernetesApiServerProxyRole role
## Progress: 72
#### stage: kubernetes: Assign Controller Role
Assigning KubernetesControllerRole role
## Progress: 73
#### stage: kubernetes: Assign Scheduler Role
Assigning KubernetesSchedulerRole role
## Progress: 74
#### stage: kubernetes: Assign Proxy Role
Assigning KubernetesProxyRole role
## Progress: 75
#### stage: kubernetes: Assign Kubelet Master Role
Assigning KubernetesNodeRole role
## Progress: 76
#### stage: kubernetes: Assign Kubelet Role
Assigning KubernetesNodeRole role
## Progress: 77
#### stage: kubernetes: Create Service Accounts CA
Sending request to recreate the CA to cmd on AE HPC Cluster
## Progress: 78
#### stage: kubernetes: Merge Kube Packages Deploy
Deploying new appGroup 'system' with applications:
bootstrap
root
## Progress: 79
#### stage: kubernetes: IP Ports Open
Open Kubernetes 'default' API Server proxy TCP port on node master88
## Progress: 82
#### stage: kubernetes: Wait For Root Service Account Ready
### ERROR FOUND ###
RetryError[<Future at 0x155483eac610 state=finished returned InitializationStates>]
Undo/Abort/Skip/Retry/Info/Debug/Remote debug: u/a/s/r/i/d/D:
It looks like you upgraded your cmdaemon package, but did not upgrade cuda-dcgm-libs. Bright switched to a newer version of DCGM with 9.2-9.
Best regards,
Martijn
@mdevries1
I have reinstalled Master and Slaves again , its a fresh OS now . 1st thing I tried on it is Kubernetes Cluster , below is the screenshot of failure , how can I ignore permissions manager or download 0.1.1 version from where ?
Same thing happens to another package , local path provisioner version , required is 0.0.20 and installed is 0.0.23
[root@master88 ~]# yum list cm-kubernetes-local*
Last metadata expiration check: 3:09:50 ago on Thu 13 Apr 2023 11:50:42 AM KST.
Installed Packages
cm-kubernetes-local-path-provisioner.x86_64 0.0.23-100133_cm9.2_d1e8abc550 @cm-rhel8-9.2-updates
Available Packages
cm-kubernetes-local-path-provisioner-images.x86_64 0.0.21-100131_cm9.2_9686bf4735 cm-rhel8-9.2-updates
[root@master88 ~]#
My CM iso is Rocky linux 8.6
FYI
I upgraded cluster-tools and cmdaemon , still I am getting the same error . How can i get kubernetes version highe r than 1.21 . I think that is the issue
Hi,
The cm-setup package should also be updated. We have updated the error message to mention cm-setup instead of cluster-tools.
The Bright version on your cluster appears to be really old. I don’t see any concrete version numbers, but the typo in the error message was fixed in 9.2-6, released in October. So the cluster must be at an even older state. It might be useful to update even more than just cm-setup, cmdaemon, cluster-tools, both on the headnode and in the software images.
Cheers,
Geert
Updating now Geert…hope I can see k8s 1.24 :)
:( , what I am supposed to do correctly , please let me know detailed steps
Managed to roll forward …
Got it…hope it doesn’t mess up now :)
@gkloosterman @mdevries1
Is the license serving down ??
creating a k8s cluster is like a war…
now my nodes are not starting kublet service
Do I need to create different images for each worker node or default image will work ?
The Kubelet error you see is because the cmdaemon
package is too old. The older version you’re running does not know yet about the kubelet flags that were removed with k8s v1.24.
Note you need to update cmdaemon both on the headnode and in the software images. Also you need to make sure the nodes get the updated cmdaemon package and the cmd
service is restarted on all nodes. Either by rebooting or by a combination of imageupdate
using cmsh or Bright View to synchronize the updated package and something like pdsh -g computenode systemctl restart cmd
to restart the service on all nodes.
Updating on Headnode and Compute Nodes …
Headnode
Compute Node
and i think its working :) , let me check further
Thanks guys for all the assistance @gkloosterman @mdevries1
I have 2 more queries if you could give me a hand
- How to Deploy addons like NVIDIA GPU operators , Prometheus etc to this running cluster . I didn’t select those while deploying this cluster as I wanted to finish the basic kubernetes setup without fail
- Next one is bit more complicated …my lab does not have internet , so its very difficult for any user to deploy K8s cluster without it .
We have local docker repo , how can we use it to deploy K8s images on local docker repo
Where all I have to make changes so that I am not dependent on yum and public container images .
@gkloosterman @mdevries1 , should I create a new post for the above queries ?
Yes it makes sense to create a new post for a new topic.