Release Notes for Nvidia Bright Cluster Manager 9.2-11

kwoods · May 11, 2023, 10:07pm

Release notes for Bright 9.2-11

== General ==
=Improvements=

Added CUDA 12.1 packages
Updated mlnx-ofed58 to 5.8-2.0.3.0
Updated cm-openssl to 3.0.8
Updated the version of the bundled with cm-kubernetes124 calico binaries (such as calicoctl) to 3.24.5 as defined in the calico manifest in cm-kubernetes124
Allow the option to specify the number of CPU cores used for building mlnx-ofed via the environment variable RPM_BUILD_NCPUS

=Fixed Issues=

Use a stricter regex in the cm-kubernetes package for the default ingress redirect for /dashboard to avoid redirects of unrelated URLs
An issue where installing the mlnx-ofed49 package may remove the rdma-core package

=Changes=

The head node will no longer create a python → python3 symlink on SLES15, which was previously required for some older python scripts
Changed the architecture of the Lmod package from independent (noarch/all) to architecture dependent, which resolves the “module ‘bit32’ not found” issue on Ubuntu

== CMDaemon ==
=New Features=

Export the user-email setting from CMDaemon as the CMD_USER_EMAIL environment variable when running the custom UserAddScript
Added REST endpoint to allow the option to POST warning events to CMDaemon from external scripts
Improve the cloud compute nodes status message to include the reason a spot request in AWS remains open when it is not fulfilled due to a lack of capacity in an AWS availability zone

=Improvements=

Ensure malformed strings in the GPU information does not corrupt its JSON serialization in CMDaemon
Added average metrics for nvswitch’s temperature, rx, and tx
Ensure the JobSampler is configured for OOB (out-of-band sampling) when upgrading to 9.2
Allow the option for one NTP server to be preferred over another, where in the case of head node HA one of the head nodes is selected as the preferred NTP server
An issue where the mounts health check does not take into account the “noauto” setting for the fsmounts, resulting in an incorrect failure of the health check when noauto fsmounts definitions are present
An issue where in some cases CMDaemon may fail to update the pbs prolog and epilog hooks with a message “could not import cm_prolog hook configuration”

=Fixed Issues=

An issue where CMDaemon may modify the drain reason message for a Slurm node if the node is already drained when the node is being stopped
An issue where on RHEL 9 pam_acct_mgmt may give an error 6, which can cause CMDaemon to prevent access to the user portal on the head node
In some cases, CMDaemon crash when stopping in the status UP/DOWN subsystem
CMDaemon crash when parsing the Slurm tres when the number of nodes for a job is 0
An issue where the gpu_mem_utilization value is in the range of 0 to 1% (Instead of the expected 0 to 100%)
Possible deadlock in the labeled entity manager
Ensure the GPU monitoring resources get set even when the system information retrieval is delayed due to slow hardware detection
An issue where the slurmctld process of the Slurm WLM may crash on “reconfigure” command by CMDaemon when the node count changes
In some cases, an issue where cm-diagnose may not collect the required information from the primary/passive head node when the secondary head node is the active head node

== Node Installer ==
=Fixed Issues=

Add the RDMA settings to the corresponding entries in the /etc/fstab file when using NFS over RDMA
Improved InfiniBand network interface name detection, which resolves an issue where the nodes installer does not recognize certain udev persistent device names as InfiniBand devices
In some cases, the (re-)generation of the ramdisk for a software image may fail due to an issue in the internal logic of the script when checking if /var/tmp is a soft-link
An issue where the disks script can fail to assemble an NVMe-member RAID correctly when the node is using SKIP/NOSYNC install modes

== Cluster Tools ==
=Fixed Issues=

An issue where cmha dbreclone may fail if the mysqldump contains special characters

== Head Node Installer ==
=New Features=

Allow IB only clusters to be configured from the head node installer. Supports defining extra network such as IB as a management network

=Fixed Issues=

An issue where the head node installer may report an error trying to determine the size of a non-existent block device such as sda
An issue where the head node installer does not validate a domain name should not begin with “.”

== Machine Learning ==
=New Features=

== cm-kubernetes-setup ==
=Fixed Issues=

An issue where cm-kubernetes-setup can crash with “AttributeError: driver” when deploying Kubernetes with enabled device plugin

== cm-scale ==
=New Features=

Improvements in the backfill algorithm of the Auto Scaler when jobs requesting too many resources are already in the queue

=Fixed Issues=

An issue with re-purposing nodes from a workload manager to Kubernetes when also the software image for the nodes is changed

== cm-wlm-setup ==
=Improvements=

== cmsh ==
=Improvements=

Allow the option to set “before” and “after” time limits when filtering or getting statistics for WLM jobs

=Fixed Issues=

An issue with setting the start time of a WLM chargeback report (using the ‘-s’ option)
An issue where the cmsh dumpmonitoringdata --time* (such as timesum, timeaverage) operations do not produce the expected result

== slurm23.02 ==
=Improvements=

Topic		Replies	Views
Release Notes for Nvidia Bright Cluster Manager 9.1-17 Base Command Manager bright , cluster-management , cluster-manager	0	691	August 14, 2023
Release Notes for Nvidia Bright Cluster Manager 9.1-16 Base Command Manager bright , cluster-management , cluster-manager	1	1034	March 2, 2023
Release Notes for Nvidia Bright Cluster Manager 9.2-9 Base Command Manager bright , cluster-management , cluster-manager	1	1054	February 20, 2023
Release Notes for Nvidia Bright Cluster Manager 9.0-20 Base Command Manager bright , cluster-management , cluster-manager	1	820	June 13, 2023
Release Notes for Nvidia Bright Cluster Manager 9.0-19 Base Command Manager bright , cluster-management , cluster-manager	1	931	January 2, 2023
Release Notes for Nvidia Bright Cluster Manager 8.2-29 Base Command Manager bright , cluster-management , cluster-manager	1	776	December 22, 2022
Release Notes for Nvidia Bright Cluster Manager 9.1-15 Base Command Manager bright , cluster-management , cluster-manager	1	857	December 22, 2022
Release Notes for Nvidia Bright Cluster Manager 9.1-14 Base Command Manager bright , cluster-management , cluster-manager	1	1132	November 1, 2022
Release notes for Nvidia Bright Cluster Manager 9.2-13 Base Command Manager bright , cluster-management , cluster-manager	0	616	July 17, 2023
Release Notes for Nvidia Bright Cluster Manager 9.2-5 Base Command Manager bright , cluster-management , cluster-manager	1	1001	November 1, 2022