Release Notes for Nvidia Bright Cluster Manager 9.0-19

kwoods · January 2, 2023, 1:50pm

Release notes for Bright 9.0-19

== General ==
=New Features=

Update cm-docker to v20.10.17

=Improvements=

Add cuda11.7 packages
Add Mellanox 5.6 OFED stack (mlnx-ofed56 packages)
Update cuda-dcgm to version 2.4.6.1
Update cuda-driver to version 510.47.03
Update cuda11.6 toolkit packages to 11.6 update 2
Update mlnx-ofed49 to version 4.9-5.1.0.0
Update mlnx-ofed56 to version 5.6-2.0.9.0
Update nvhpc to version 22.3

=Fixed Issues=

mlnx-ofed: Incorrect values for the LD* environment variables in the Ubuntu openmpi module file
An issue with installing individual Bright packages on RHEL8 / Rocky8 clusters with FIPS enabled due to the use of MD5 file digests rather than SHA256 file digests
cm-kubernetes: make the https://:30443/dashboard ingress redirect to /dashboard/ to resolve browser-side issues where the browser will show an empty page instead of the dashboard
mlnx-ofed56, mlnx-ofed55, mlnx-ofed54, mlnx-ofed49: added Mellanox OFED KMOD/KMP package build functionality for RPM based distributions
cuda-driver: Load the nvidia_drm kernel module from the cuda-driver script, which otherwise can result in missing EGL devices in /dev/dri

== CMDaemon ==
=New Features=

Introduce new CMDaemon advanced configuration options for customizing global nginx.conf values
Introduce new CMDaemon advanced configuration flag that will allow the use of the head node hostname instead of the default master value for the AccountingStorageHost and ControlAddr parameters in the slurm.conf file
Introduce new CMDaemon advanced configuration flags that will allow specifying the From hostname in the email address for emails sent by the sendemail monitoring action
Allow setting a custom per network interface MTU or disabling setting the MTU in the network interface configuration file

=Improvements=

Reduce verbosity of the ‘result for obsolete tracker’ messages, so that they are no longer included by default in the CMDaemon log file
Improved logic when invalidating the nscd hosts cache on the compute nodes, to avoid cases where an outdated cache interferes with hostnames lookup
CMDaemon certificates are now generated with a start date of 1 calendar day before the issue date, instead of the Unix epoch
An issue where cm-manipulate-advanced-config.py is missing a python import, resulting in a crash when executed
Added an endpoint prometheus/api/v1/status/buildinfo for the latest Grafana
Allow for monitoring triggers to set post-drain actions
Improved mysql health check no longer requires the mysql password to be included on the command line
Include the command line arguments in the information events generated by CMDaemon when a kubectl command times out
Modifying a network in CMDaemon that is used by Kubernetes will now request the relevant Kubernetes services to update their configuration and restart
An issue where monitoring data for completed jobs is not always removed, which in some cases leads to CMDaemon allocating too much memory
Add category labels to the devices in PromQL
New Program Runner tracing levels to make it less verbose by default, which can decrease the number of logged lines in the CMDaemon log file
Disable the software image /boot directory associations for cloud directors with a list of localimages and allimages set to “no”, which means that /boot of unrelated software images will no longer be synced to the cloud director
In some cases, the cmsh monitoringdataproducer command can crash CMDaemon if the command is executed while in the category mode
Added extra API endpoints to the Prometheus interface for Grafana for version 8.0
Ensure the head node(s) do not fall back to running in a compute node mode when mariadb is not in a good state while CMDaemon is starting
Decrease the timeouts for the CMDaemon service so that CMDaemon is stopped faster

=Fixed Issues=

An issue with CMDaemon events delivery to edge nodes, which can result in an outdated information about committed entities
An issue with setting up Kubernetes when the passive head node is the active leader according to Etcd, which results in some cases in Kubernetes not able to initialize properly
An issue where PBS queue options set in CMDaemon may not be set in the PBS server configuration
An issue with generating a valid Kubernetes kubeconfig for users with special characters in their login name. Performance improvements of the user manager
Rare crash in CMDaemon while cloning an image
An issue where CMDaemon can crash if the Bright View monitoring tree call does not pass a context
Add full support for multi-value http request parameters, which resolves an issue where the “CMDaemon ready” service is not able to handle a list of services by name
In some cases, terminating spot instances with CMDaemon may fail if the spot request has been cancelled outside of CMDaemon
An issue where CMDaemon may occasionally hang on SSL_read while stopping
An issue where the oomkiller health check may not detect the OOM killer has run on RHEL8 compute nodes
An issue where password crypt can generate duplicate edge site secret hashes
An issue where some older base distribution versions of openssl are unable create FIPS compliant DH parameters during add-on installation
An issue with configuring the Postfix root alias in /etc/aliases on distros using Postfix 3.0 and higher, where emails to root on the compute nodes can no longer be delivered
Do not retry CMProc::rexecCommand when the ptracker is no longer defined, which otherwise can result in error messages in the CMDaemon log file
An issue with dumping the data for all entities and measurables when using the REST API
An issue with the pythoncm programrunnerstatus kill method not working in some cases
An issue where CMDaemon may attempt to start slurmdbd service before its configuration file has been updated after HA takeover
Typo in the CMDaemon’s cookie manager which in some cases can result in the users unable to login to the user portal
An issue deploying openpbs with the server role assigned to multiple compute nodes

== Bright View ==
=Fixed Issues=

An issue with updated properties such as fsmount when a node or category has also static routes, resulting in an error message “The destination cannot be empty”
An issue with showing the Last Change date for users
An issue with clearing the BMC user-id setting in Bright View when the value is negative

== Node Installer ==
=Improvements=

New disableNodeInstallerNFSCertificateStore configuration setting in the node-installer.conf file to allow for disabling the certificates mount
Allow the node-installer to continue configuring IPMI after a failure to set username and password if the user already exists

=Fixed Issues=

An issue with the configure_ipmi.pl script not working when the user id is set to 0
An issue where disabled provisioning associations in the node-installer may still be rsynced

== Cluster Tools ==
=Fixed Issues=

An issue with cloning the mysql database when using cmha dbreclone when a configuration file /root/.my.cnf with other mysql credentials exists

== cmjob ==
=Fixed Issues=

An issue with transferring pbs job arrays outputs

== Machine Learning ==
=New Features=

Introduce packages cm-cudnn8.2-cuda11.4
Introduce packages cm-cudnn8.4-cuda11.4

=Improvements=

Introduce environment variable JUPYTER_KERNEL_TEMPLATES_DIR for cm-jupyter-kernel-creator templates

== cm-create-image ==
=Fixed Issues=

An issue where images created with cm-create-image do not preserve the xattrs of the base tar image
An issue where node-installer images created using the cm-create-image tool do not have an updated rsyslog.conf file
An issue where the sanity checks fail for archives created with a leading “./”

== cm-kubernetes-setup ==
=Improvements=

Enable by default the selection of newer Kubernetes versions in the cm-kubernetes-setup screens
Enable the selection of newer Kubernetes versions by default, which until now was available oly if a special command line option was used
An issue with Kubernetes on Edge deployments, where the stage “waiting for Root Service Account” is performed too early and may not complete successfully in some cases
In the Kubernetes module files, remove the MANPATH definitions which are no longer used
The ‘enabled’ fields under the ‘calico:’ and ‘flannel:’ blocks in the cm-kubernetes-setup configuration files are no longer used and have been removed
Use the --overwrite command line flag when running kubectl taint to avoid errors when taint already exists

=Fixed Issues=

Allow shorewall traffic between calico (cali+) wildcard interfaces to be routed back to the same interface, to resolve an issue where some services are unable to connect and are reporting a timeout

== cm-scale ==
=Fixed Issues=

In some cases, an issue with detecting failures to create cloud node instances

== cm-uge ==
=Improvements=

Update the default settings in cm-uge to allow running OpenMPI jobs without involving ssh

== cm-wlm-setup ==
=Improvements=

Deployment of IBM Spectrum LSF Suite is no longer supported. The supported option remains the deployment of LSF Standard Edition
Automatically remove the WLM settings from the Auto Scaler configuration when the WLM is disabled

=Fixed Issues=

An issue with making the pbs.service file available on the compute nodes with offloaded PBSPro server role, which prevents the PBSPro server from starting during the setup

== cmsh ==
=Fixed Issues=

An issue where the XSD validation is not always loaded in cmsh when configuring a disk setup for the compute nodes
An issue where tab completions do not work in the cmsh role mode
cmsh color off command doesn’t turn off all colors
An issue where cloning users or groups in cmsh does not reset some of the settings to the correct default values
cmsh permissions on Ubuntu are 700 instead of 755

== openpbs22.05 ==
=Improvements=

Add OpenPBS 22.05 integration

== slurm ==
=Improvements=

Rebuild the Ubuntu Slurm packages with cm-pmix3

== slurm21.08 ==
=Improvements=

Upgrade to 21.08.6

=Fixed Issues

An issue with srun producing at the end if its execution messages it is unable to read files under /sys/fs/cgroup

Topic		Replies	Views
Release Notes for Nvidia Bright Cluster Manager 9.1-16 Base Command Manager bright , cluster-management , cluster-manager	1	1014	March 2, 2023
Release Notes for Nvidia Bright Cluster Manager 9.0-20 Base Command Manager bright , cluster-management , cluster-manager	1	800	June 13, 2023
Release Notes for Nvidia Bright Cluster Manager 9.2-9 Base Command Manager bright , cluster-management , cluster-manager	1	1035	February 20, 2023
Release Notes for Nvidia Bright Cluster Manager 9.1-14 Base Command Manager bright , cluster-management , cluster-manager	1	1120	November 1, 2022
Release Notes for Nvidia Bright Cluster Manager 9.2-10 Base Command Manager bright , cluster-management , cluster-manager	1	1034	June 13, 2023
Release Notes for Nvidia Bright Cluster Manager 9.2-5 Base Command Manager bright , cluster-management , cluster-manager	1	985	November 1, 2022
Release Notes for Nvidia Bright Cluster Manager 9.2-11 Base Command Manager bright , cluster-management , cluster-manager	1	1010	June 13, 2023
Release Notes for Nvidia Bright Cluster Manager 9.1-15 Base Command Manager bright , cluster-management , cluster-manager	1	843	December 22, 2022
Release Notes for Nvidia Bright Cluster Manager 8.2-29 Base Command Manager bright , cluster-management , cluster-manager	1	761	December 22, 2022
Release notes for Nvidia Bright Cluster Manager 9.2-14 Base Command Manager bright , cluster-management , cluster-manager	0	713	October 3, 2023

Release Notes for Nvidia Bright Cluster Manager 9.0-19

Related topics