Creating kubernetes cluster on BCM , Exception: 'Version of the Local Path Provisioner 0.0.23 is too new

karanveersingh5623 · April 12, 2023, 5:15am

HI Team

Below is the error in setting up of K8s in BCM

I 12-Apr-2023 13:55:03 | cmsetup.engine.cmlog | ## Progress: 38
I 12-Apr-2023 13:55:03 | cmsetup.engine.cmlog | #### stage: kubernetes: Check Local Path Provisioner Version
D 12-Apr-2023 13:55:03 | exec_helpers.Subprocess | Executing command:
'/cm/local/apps/kubernetes/current/bin/helm show chart /cm/shared/apps/kubernetes-local-path-provisioner/current/helm/cm-kubernetes-local-path-provisioner-*.tgz'

D 12-Apr-2023 13:55:03 | exec_helpers.Subprocess | apiVersion: v2
D 12-Apr-2023 13:55:03 | exec_helpers.Subprocess | appVersion: 0.0.23
D 12-Apr-2023 13:55:03 | exec_helpers.Subprocess | description: A Helm chart for local path provisioner
D 12-Apr-2023 13:55:03 | exec_helpers.Subprocess | name: cm-kubernetes-local-path-provisioner
D 12-Apr-2023 13:55:03 | exec_helpers.Subprocess | type: application
D 12-Apr-2023 13:55:03 | exec_helpers.Subprocess | version: 0.0.23
D 12-Apr-2023 13:55:03 | exec_helpers.Subprocess |
D 12-Apr-2023 13:55:03 | exec_helpers.Subprocess | Command '/cm/local/apps/kubernetes/current/bin/helm show chart /cm/shared/apps/kubernetes-local-path-provisioner/current/helm/cm-kubernetes-local-path-provisioner-*.tgz' exit code: EX_OK<0(0x00)>
D 12-Apr-2023 13:55:03 | cmsetup.engine.cmsetup | Disconnect from cluster.
I 12-Apr-2023 13:55:03 | pythoncm.entity_change | Stop event change watcher
I 12-Apr-2023 13:55:03 | cmsetup.engine.cmsetup |
I 12-Apr-2023 13:55:03 | cmsetup.engine.cmsetup | Took:     00:07 min.
I 12-Apr-2023 13:55:03 | cmsetup.engine.cmsetup | Progress: 38/100
I 12-Apr-2023 13:55:03 | cmsetup.engine.cmsetup | ################### Finished execution for 'Kubernetes Setup', status: Failed (aborted)
I 12-Apr-2023 13:55:03 | cmsetup.engine.cmsetup |
I 12-Apr-2023 13:55:03 | cmsetup.engine.cmsetup |
E 12-Apr-2023 13:55:03 | cmsetup.engine.main | Exception: 'Version of the Local Path Provosioner 0.0.23 is too new for cluster-tools. Latest supported is 0.0.20. Please update.'
E 12-Apr-2023 13:55:03 | cmsetup.engine.main | Version of the Local Path Provosioner 0.0.23 is too new for cluster-tools. Latest supported is 0.0.20. Please update.
I 12-Apr-2023 13:55:03 | cmsetup.engine.main |
I 12-Apr-2023 13:55:03 | cmsetup.engine.main | See the /var/log/cm-setup.log for details.
I 12-Apr-2023 13:55:03 | cmsetup.engine.main |
D 12-Apr-2023 13:55:03 | cmsetup.engine.cmlog |
D 12-Apr-2023 13:55:03 | cmsetup.engine.cmlog | ################################  END  ###########################################
D 12-Apr-2023 13:55:03 | cmsetup.engine.cmlog |

karanveersingh5623 · April 12, 2023, 6:00am

I managed to fix the above but my cmd service on compute nodes are failing after reboot

[root@node001 ~]# systemctl status cmd.service
● cmd.service - Bright Computing Cluster Manager daemon
   Loaded: loaded (/usr/lib/systemd/system/cmd.service; disabled; vendor preset: disabled)
   Active: inactive (dead)
[root@node001 ~]# systemctl restart cmd.service
Job for cmd.service failed because the control process exited with error code.
See "systemctl status cmd.service" and "journalctl -xe" for details.

Apr 12 14:56:40 node001 safe_cmd[9130]: + logger -s -p daemon.crit -t CMDaemon 'CMDaemon indicates it could not start (most likely due to missing library)'
Apr 12 14:56:40 node001 CMDaemon[9157]: CMDaemon indicates it could not start (most likely due to missing library)
Apr 12 14:56:40 node001 safe_cmd[9157]: <26>Apr 12 14:56:40 CMDaemon: CMDaemon indicates it could not start (most likely due to missing library)
Apr 12 14:56:40 node001 safe_cmd[9130]: + sendFailedToStartWarningMail
Apr 12 14:56:40 node001 safe_cmd[9158]: + cat
Apr 12 14:56:40 node001 safe_cmd[9160]: ++ hostname
Apr 12 14:56:40 node001 safe_cmd[9161]: ++ hostname
Apr 12 14:56:40 node001 safe_cmd[9159]: + mail -s 'CMDaemon was unable to start on node001!' root@master
Apr 12 14:56:40 node001 safe_cmd[9164]: ++ ls -hal '/var/spool/cmd/core.*'
Apr 12 14:56:40 node001 safe_cmd[9166]: ++ ls -hal '/var/lib/systemd/coredump/cmd.core.*'
Apr 12 14:56:40 node001 safe_cmd[9130]: + runHook failed_to_start_warning_hook
Apr 12 14:56:40 node001 safe_cmd[9130]: + filename=/cm/local/apps/cmd/sbin/failed_to_start_warning_hook
Apr 12 14:56:40 node001 safe_cmd[9130]: + '[' -x /cm/local/apps/cmd/sbin/failed_to_start_warning_hook ']'
Apr 12 14:56:40 node001 safe_cmd[9130]: + break
Apr 12 14:56:40 node001 safe_cmd[9130]: + '[' -s /var/spool/cmd/cmd.output.bYrz7xTIRv ']'
Apr 12 14:56:40 node001 safe_cmd[9169]: + echo 'CMDaemon stdout (note that timestamps could be incorrect):'
Apr 12 14:56:40 node001 safe_cmd[9170]: + logger -s -p daemon.crit -t CMDaemon
Apr 12 14:56:40 node001 safe_cmd[9169]: + cat /var/spool/cmd/cmd.output.bYrz7xTIRv
Apr 12 14:56:40 node001 CMDaemon[9170]: CMDaemon stdout (note that timestamps could be incorrect):
Apr 12 14:56:40 node001 safe_cmd[9170]: <26>Apr 12 14:56:40 CMDaemon: CMDaemon stdout (note that timestamps could be incorrect):
Apr 12 14:56:40 node001 safe_cmd[9170]: <26>Apr 12 14:56:40 CMDaemon: /cm/local/apps/cmd/sbin/cmd: error while loading shared libraries: libdcgm.so.2: cannot open shared object file: No such file or directory
Apr 12 14:56:40 node001 CMDaemon[9170]: /cm/local/apps/cmd/sbin/cmd: error while loading shared libraries: libdcgm.so.2: cannot open shared object file: No such file or directory
Apr 12 14:56:40 node001 safe_cmd[9130]: + rm -f /var/spool/cmd/cmd.output.bYrz7xTIRv
Apr 12 14:56:40 node001 safe_cmd[9130]: + exit 127
Apr 12 14:56:40 node001 systemd[1]: cmd.service: Main process exited, code=exited, status=127/n/a
Apr 12 14:56:41 node001 wait_cmd[9131]: + let retries=55-1
Apr 12 14:56:41 node001 wait_cmd[9131]: + '[' 54 -gt 0 ']'
Apr 12 14:56:41 node001 wait_cmd[9131]: + '[' -z 9130 -o '!' -d /proc/9130 ']'
Apr 12 14:56:41 node001 wait_cmd[9131]: + exit 1
Apr 12 14:56:41 node001 systemd[1]: cmd.service: Control process exited, code=exited status=1
Apr 12 14:56:41 node001 systemd[1]: cmd.service: Failed with result 'exit-code'.
-- Subject: Unit failed
-- Defined-By: systemd
-- Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- The unit cmd.service has entered the 'failed' state with result 'exit-code'.
Apr 12 14:56:41 node001 systemd[1]: Failed to start Bright Computing Cluster Manager daemon.


## Progress: 59
#### stage: kubernetes: Apply Sysctl Config
## Progress: 60
#### stage: kubernetes: Collection Nodes Reboot
node001: reboot requested
node002: reboot requested
node003: reboot requested
Press ctrl+c to abort waiting and continue with deployment
Waiting for nodes to start reboot
Going to wait up to 15 minutes for the nodes to come back up.


Your nodes have already been rebooting for a significant amount of time (half of the total timeout time). This might be expected due to relatively slow network/hardware or due to large software images. Please make sure the nodes are continuing the provisioning process properly. If they are, ignore this message, sit back, and wait for the process to finish.


Not all of the nodes came back up after the reboot (node001, node002, node003). However, none of those were mandatory for the setup process . Waited for 15 minutes, not waiting any longer, setup process will proceed.
## Progress: 61
#### stage: kubernetes: Disable Swap On Compute Nodes
## Progress: 62
#### stage: kubernetes: Create Exclude List For Gpu Operator
## Progress: 63
#### stage: kubernetes: Create Configuration Overlay
Creating configuration overlay kube-default-etcd
Adding nodes
## Progress: 64
#### stage: kubernetes: Assign Etcd Role
Assigning EtcdHostRole role
## Progress: 65
#### stage: kubernetes: Collection Data Node Enable
## Progress: 68
#### stage: kubernetes: Assign Containerd Role
## Progress: 70
#### stage: kubernetes: Assign Api Server Role
Assigning KubernetesApiServerRole role
## Progress: 71
#### stage: kubernetes: Assign Api Server Proxy Role
Assigning KubernetesApiServerProxyRole role
## Progress: 72
#### stage: kubernetes: Assign Controller Role
Assigning KubernetesControllerRole role
## Progress: 73
#### stage: kubernetes: Assign Scheduler Role
Assigning KubernetesSchedulerRole role
## Progress: 74
#### stage: kubernetes: Assign Proxy Role
Assigning KubernetesProxyRole role
## Progress: 75
#### stage: kubernetes: Assign Kubelet Master Role
Assigning KubernetesNodeRole role
## Progress: 76
#### stage: kubernetes: Assign Kubelet Role
Assigning KubernetesNodeRole role
## Progress: 77
#### stage: kubernetes: Create Service Accounts CA
Sending request to recreate the CA to cmd on AE HPC Cluster
## Progress: 78
#### stage: kubernetes: Merge Kube Packages Deploy
Deploying new appGroup 'system' with applications:
        bootstrap
        root
## Progress: 79
#### stage: kubernetes: IP Ports Open
Open Kubernetes 'default' API Server proxy TCP port on node master88
## Progress: 82
#### stage: kubernetes: Wait For Root Service Account Ready
### ERROR FOUND ###
RetryError[<Future at 0x155483eac610 state=finished returned InitializationStates>]
Undo/Abort/Skip/Retry/Info/Debug/Remote debug: u/a/s/r/i/d/D:

mdevries1 · April 12, 2023, 1:45pm

It looks like you upgraded your cmdaemon package, but did not upgrade cuda-dcgm-libs. Bright switched to a newer version of DCGM with 9.2-9.

Best regards,

Martijn

karanveersingh5623 · April 13, 2023, 5:55am

@mdevries1

I have reinstalled Master and Slaves again , its a fresh OS now . 1st thing I tried on it is Kubernetes Cluster , below is the screenshot of failure , how can I ignore permissions manager or download 0.1.1 version from where ?

karanveersingh5623 · April 13, 2023, 6:04am

Same thing happens to another package , local path provisioner version , required is 0.0.20 and installed is 0.0.23

[root@master88 ~]# yum list cm-kubernetes-local*
Last metadata expiration check: 3:09:50 ago on Thu 13 Apr 2023 11:50:42 AM KST.
Installed Packages
cm-kubernetes-local-path-provisioner.x86_64                                                            0.0.23-100133_cm9.2_d1e8abc550                                                      @cm-rhel8-9.2-updates
Available Packages
cm-kubernetes-local-path-provisioner-images.x86_64                                                     0.0.21-100131_cm9.2_9686bf4735                                                      cm-rhel8-9.2-updates
[root@master88 ~]#

My CM iso is Rocky linux 8.6

karanveersingh5623 · April 13, 2023, 6:42am

FYI
I upgraded cluster-tools and cmdaemon , still I am getting the same error . How can i get kubernetes version highe r than 1.21 . I think that is the issue

gkloosterman · April 13, 2023, 8:44am

Hi,

The cm-setup package should also be updated. We have updated the error message to mention cm-setup instead of cluster-tools.

The Bright version on your cluster appears to be really old. I don’t see any concrete version numbers, but the typo in the error message was fixed in 9.2-6, released in October. So the cluster must be at an even older state. It might be useful to update even more than just cm-setup, cmdaemon, cluster-tools, both on the headnode and in the software images.

Cheers,
Geert

karanveersingh5623 · April 14, 2023, 12:54am

Updating now Geert…hope I can see k8s 1.24 :)

karanveersingh5623 · April 14, 2023, 1:01am

:( , what I am supposed to do correctly , please let me know detailed steps

karanveersingh5623 · April 14, 2023, 1:20am

Managed to roll forward …

karanveersingh5623 · April 14, 2023, 1:22am

Got it…hope it doesn’t mess up now :)

karanveersingh5623 · April 14, 2023, 1:33am

@gkloosterman @mdevries1
Is the license serving down ??

karanveersingh5623 · April 14, 2023, 5:58am

creating a k8s cluster is like a war…

now my nodes are not starting kublet service

karanveersingh5623 · April 14, 2023, 7:02am

Do I need to create different images for each worker node or default image will work ?

gkloosterman · April 15, 2023, 5:37pm

The Kubelet error you see is because the cmdaemon package is too old. The older version you’re running does not know yet about the kubelet flags that were removed with k8s v1.24.

Note you need to update cmdaemon both on the headnode and in the software images. Also you need to make sure the nodes get the updated cmdaemon package and the cmd service is restarted on all nodes. Either by rebooting or by a combination of imageupdate using cmsh or Bright View to synchronize the updated package and something like pdsh -g computenode systemctl restart cmd to restart the service on all nodes.

karanveersingh5623 · April 17, 2023, 3:02am

Updating on Headnode and Compute Nodes …

Headnode

Compute Node

and i think its working :) , let me check further

Thanks guys for all the assistance @gkloosterman @mdevries1

karanveersingh5623 · April 17, 2023, 3:08am

I have 2 more queries if you could give me a hand

How to Deploy addons like NVIDIA GPU operators , Prometheus etc to this running cluster . I didn’t select those while deploying this cluster as I wanted to finish the basic kubernetes setup without fail
Next one is bit more complicated …my lab does not have internet , so its very difficult for any user to deploy K8s cluster without it .
We have local docker repo , how can we use it to deploy K8s images on local docker repo
Where all I have to make changes so that I am not dependent on yum and public container images .

karanveersingh5623 · April 19, 2023, 12:21am

@gkloosterman @mdevries1 , should I create a new post for the above queries ?

mdevries1 · April 19, 2023, 2:03pm

Yes it makes sense to create a new post for a new topic.

karanveersingh5623 · April 20, 2023, 2:08am

Created , please check

Topic		Replies	Views
Running Replicator 1.6.3 from inside of kubernetes (microk8s nor minikube) Synthetic Data Generation (SDG) docker , vulkan , synthetic-data , kubernetes , omniverse	7	746	April 2, 2024
GPUOperator Support on CentOS 7.8 - GLIBC_2.27 Docker and NVIDIA Docker	0	1886	August 14, 2020
Cannot passthrough GPU to Kubernetes pod on the Jetson AGX Orin dev kit Jetson AGX Orin gpu , kubernetes	15	196	April 20, 2025
Exception: TAO4 AutoML with PeopleNet. Round 2 TAO Toolkit	49	943	June 28, 2023
Completely purge and reinstall nvidia gpu operator TAO Toolkit	41	5850	September 5, 2023
CUDA 2.1 beta CUDA Programming and Performance	49	67164	December 3, 2008
Is the vectoradd-cuda container for 11.4 available? CUDA Programming and Performance	6	1937	August 4, 2021
CUDA Toolkit and SDK v2.2 released CUDA Programming and Performance	59	64623	January 25, 2011
Warning Unhealthy kubelet Startup probe failed: Get "v1/health/ready": dial tcp 10.1.124.81:8000: connect: connection refused Visual AI Agent nvbugs , nim , llama	31	195	April 14, 2025
CUDA Toolkit 3.0 beta released now with public downloads CUDA Programming and Performance	104	430103	March 25, 2010

Creating kubernetes cluster on BCM , Exception: 'Version of the Local Path Provisioner 0.0.23 is too new

Related topics