Release Notes for Nvidia Bright Cluster Manager 9.1-17

Release notes for Bright 9.1-17

== General ==
=Major Updates to Packages=

  • Added CUDA 12.1 packages
  • Updated mlnx-ofed23.04 to 23.04-
  • Updated mlnx-ofed58 to 5.8-
  • Updated openssl to 1.1.1t

= Improvements =

  • The head node will no longer create a python → python3 symlink on SLES15, which was previously required for some older python scripts
  • Allow the option to specify the number of CPU cores used for building mlnx-ofed via the environment variable RPM_BUILD_NCPUS

=Fixed Issues=

  • An issue where the CM Lmod package may be replaced by an Lmod package from EPEL
  • Changed the architecture of the Lmod package from independent (noarch/all) to architecture dependent, which resolves the “module ‘bit32’ not found” issue on Ubuntu
  • Use a stricter regex in the cm-kubernetes package for the default ingress redirect for /dashboard to avoid redirects of unrelated URLs
  • An issue where the 90-cm-sysctl.conf file is not marked as configuration file on Ubuntu base distributions
  • An issue where installing the mlnx-ofed49 package may remove the rdma-core package
  • An issue where the mlnx-ofed install script does not add libibverbs packages to the dnf exclude list, which can break OFED compatibility when later “dnf update” is run to update the packages

== CMDaemon ==
= Improvements =

  • Ensure malformed strings in the GPU information do not corrupt the JSON serialization in CMDaemon
  • Allow the option for one NTP server to be preferred over another, where in the case of head node HA one of the head nodes is selected as the preferred NTP server
  • An issue where the mounts health check does not take into account the “noauto” setting for the fsmounts, resulting in an incorrect failure of the health check when noauto fsmounts definitions are present

= Fixed Issues=

  • In some cases, CMDaemon crash when stopping
  • An issue where the interfaces health check can report failure on compute nodes with a ConnectX IB card in UEFI mode as the BOOTIF interface
  • CMDaemon crash when parsing the Slurm TRES when the number of nodes for a job is 0
  • An issue where CMDaemon adds to /etc/chrony.conf, which is not valid for chrony
  • An issue where the slurmctld process of the Slurm WLM may crash on “reconfigure” command by CMDaemon when the node count changes
  • An issue with monitoring data plots consisting of consolidated and raw data sources

== Node Installer ==
=Fixed Issues=

  • An issue where the RDMA settings are not added to the corresponding entries in the /etc/fstab file when using NFS over RDMA
  • Improved InfiniBand network interface name detection, which resolves an issue where the nodes installer does not recognize certain udev persistent device names as InfiniBand devices
  • In some cases, an issue with the bootif_detect script unable to detect the correct InfiniBand (IB) device when there are multiple IB interfaces
  • An issue where the disks script can fail to assemble an NVMe-member RAID correctly when the node is using SKIP/NOSYNC install modes

== Cluster Tools ==
= Improvements =

  • Added cm-cmd-ports utility for modifying the CMDaemon HTTPS port

= Fixed Issues =

  • An issue where cmha dbreclone may fail if the mysql DB dump contains special characters

== Machine Learning ==
=New Features =

  • Introduced ML package cm-cudnn8.5-cuda11.8

== cm-scale ==
= New Features =

  • Improvements in the backfill algorithm of the Auto Scaler when jobs requesting too many resources are already in the queue

== cm-wlm-setup ==
= Improvements =

  • cm-wlm-setup now installs enroot 3.4.1 with Slurm

== cmsh ==
= Improvements=

  • Allow the option to set “before” and “after” time limits when filtering or getting statistics for WLM jobs

== jupyter ==
= Improvements=

  • Allow the option to sort the Slurm jobs table in the Jupyter web interface

== slurm23.02 ==
= Improvements =

  • Updated Slurm 23.02 to 23.02.2