Release Notes for Nvidia Bright Cluster Manager 9.2-11

Release notes for Bright 9.2-11

== General ==
=Improvements=

  • Added CUDA 12.1 packages
  • Updated mlnx-ofed58 to 5.8-2.0.3.0
  • Updated cm-openssl to 3.0.8
  • Updated the version of the bundled with cm-kubernetes124 calico binaries (such as calicoctl) to 3.24.5 as defined in the calico manifest in cm-kubernetes124
  • Allow the option to specify the number of CPU cores used for building mlnx-ofed via the environment variable RPM_BUILD_NCPUS

=Fixed Issues=

  • Use a stricter regex in the cm-kubernetes package for the default ingress redirect for /dashboard to avoid redirects of unrelated URLs
  • An issue where installing the mlnx-ofed49 package may remove the rdma-core package

=Changes=

  • The head node will no longer create a python → python3 symlink on SLES15, which was previously required for some older python scripts
  • Changed the architecture of the Lmod package from independent (noarch/all) to architecture dependent, which resolves the “module ‘bit32’ not found” issue on Ubuntu

== CMDaemon ==
=New Features=

  • Export the user-email setting from CMDaemon as the CMD_USER_EMAIL environment variable when running the custom UserAddScript
  • Added REST endpoint to allow the option to POST warning events to CMDaemon from external scripts
  • Improve the cloud compute nodes status message to include the reason a spot request in AWS remains open when it is not fulfilled due to a lack of capacity in an AWS availability zone

=Improvements=

  • Ensure malformed strings in the GPU information does not corrupt its JSON serialization in CMDaemon
  • Added average metrics for nvswitch’s temperature, rx, and tx
  • Ensure the JobSampler is configured for OOB (out-of-band sampling) when upgrading to 9.2
  • Allow the option for one NTP server to be preferred over another, where in the case of head node HA one of the head nodes is selected as the preferred NTP server
  • An issue where the mounts health check does not take into account the “noauto” setting for the fsmounts, resulting in an incorrect failure of the health check when noauto fsmounts definitions are present
  • An issue where in some cases CMDaemon may fail to update the pbs prolog and epilog hooks with a message “could not import cm_prolog hook configuration”

=Fixed Issues=

  • An issue where CMDaemon may modify the drain reason message for a Slurm node if the node is already drained when the node is being stopped
  • An issue where on RHEL 9 pam_acct_mgmt may give an error 6, which can cause CMDaemon to prevent access to the user portal on the head node
  • In some cases, CMDaemon crash when stopping in the status UP/DOWN subsystem
  • CMDaemon crash when parsing the Slurm tres when the number of nodes for a job is 0
  • An issue where the gpu_mem_utilization value is in the range of 0 to 1% (Instead of the expected 0 to 100%)
  • Possible deadlock in the labeled entity manager
  • Ensure the GPU monitoring resources get set even when the system information retrieval is delayed due to slow hardware detection
  • An issue where the slurmctld process of the Slurm WLM may crash on “reconfigure” command by CMDaemon when the node count changes
  • In some cases, an issue where cm-diagnose may not collect the required information from the primary/passive head node when the secondary head node is the active head node

== Node Installer ==
=Fixed Issues=

  • Add the RDMA settings to the corresponding entries in the /etc/fstab file when using NFS over RDMA
  • Improved InfiniBand network interface name detection, which resolves an issue where the nodes installer does not recognize certain udev persistent device names as InfiniBand devices
  • In some cases, the (re-)generation of the ramdisk for a software image may fail due to an issue in the internal logic of the script when checking if /var/tmp is a soft-link
  • An issue where the disks script can fail to assemble an NVMe-member RAID correctly when the node is using SKIP/NOSYNC install modes

== Cluster Tools ==
=Fixed Issues=

  • An issue where cmha dbreclone may fail if the mysqldump contains special characters

== Head Node Installer ==
=New Features=

  • Allow IB only clusters to be configured from the head node installer. Supports defining extra network such as IB as a management network

=Fixed Issues=

  • An issue where the head node installer may report an error trying to determine the size of a non-existent block device such as sda
  • An issue where the head node installer does not validate a domain name should not begin with “.”

== Machine Learning ==
=New Features=

  • Introduced ML package cm-cudnn8.5-cuda11.8

== cm-kubernetes-setup ==
=Fixed Issues=

  • An issue where cm-kubernetes-setup can crash with “AttributeError: driver” when deploying Kubernetes with enabled device plugin

== cm-scale ==
=New Features=

  • Improvements in the backfill algorithm of the Auto Scaler when jobs requesting too many resources are already in the queue

=Fixed Issues=

  • An issue with re-purposing nodes from a workload manager to Kubernetes when also the software image for the nodes is changed

== cm-wlm-setup ==
=Improvements=

  • cm-wlm-setup now installs enroot 3.4.1 with Slurm

== cmsh ==
=Improvements=

  • Allow the option to set “before” and “after” time limits when filtering or getting statistics for WLM jobs

=Fixed Issues=

  • An issue with setting the start time of a WLM chargeback report (using the ‘-s’ option)
  • An issue where the cmsh dumpmonitoringdata --time* (such as timesum, timeaverage) operations do not produce the expected result

== slurm23.02 ==
=Improvements=

  • Updated to 23.02.1