Release Notes for Nvidia Bright Cluster Manager 9.0-20

Release notes for Bright 9.0-20

== General ==
=New Features=

  • Added cm-gcc12 package
  • Added CUDA 11.8 packages
  • Added CUDA 12.0 packages
  • Added mlnx-ofed57 packages
  • Updated cm-nvhpc to 23.1
  • Updated DCGM to 3.1
  • Updated lua to 5.4.4 (CVE-2022-28805)
  • Updated mlnx-ofed49 to 4.9-6.0.6.0
  • Updated mlnx-ofed54 to 5.4-3.5.8.0
  • Updated openssl to 1.1.1t
  • Account for loading the nvidia-peermem kernel module in the cuda-driver installation and service scripts
  • Allow the option to specify the number of CPU cores used for building mlnx-ofed via the environment variable RPM_BUILD_NCPUS
  • An issue where the CM Lmod package may be replaced by an Lmod package from EPEL

=Fixed Issues=

  • An issue where installing a mlnx-ofedXY package may remove the rdma-core package
  • An issue where the mlnx-ofed installation scripts do not add the libibverbs packages to the exclude list for the operating system package manager, which can break OFED compatibility when later “dnf update” is run to update the packages
  • An issue where the mlnx-ofed installation scripts do not add the kernel-core packages to the exclude lists for the operating system package manager, which prevents the OFED kernel modules from loading when the kernel is accidentally upgraded
  • An issue where cm-chroot-sw-img is unable to execute a shell in the software image when the user has defined a $SHELL environment variable for a shell that is not present in the software image

== CMDaemon ==
=New Features=

  • Support for Slurm 22.05
  • Support for PBS Pro 2022
  • Added cm-package-release-info tool, which can be used to determine the Bright 9.X-Y version of the installed packages
  • Allow the option for one NTP server to be preferred over another, where in the case of head node HA one of the head nodes is selected as the preferred NTP server
  • An issue where the CMDaemon mounts health check does not take into account the “noauto” setting for the fsmounts, resulting in an incorrect failure of the health check when noauto fsmounts definitions are present
  • An issue with sampling Kubernetes network metrics which are linked to network interfaces
  • Exclude all /snap/.* mount points from the “procmounts” sampler, which otherwise create unnecessary metrics in CMDaemon
  • Copy the cluster.csr.new file to all head nodes when installing a license with install-license

=Fixed Issues=

  • A CMDaemon memory leak when the Slurm placeholders maxnodes value is less than nodes in the queue
  • An issue where the CMDaemon gpfs monitoring script can throw a TypeError exception when writing monitoring extra information messages for CMDaemon
  • Allow the option to choose via AdvancedConfig setting the default namespace for Kubernetes users. The default will now prefer the $user-restricted namespace.
  • An issue where slurmdbd is not automatically started when the Slurm configuration files are frozen in cmd.conf
  • An issue with removing OSDs from a Ceph cluster if the corresponding OSD nodes are down
  • In an HA setup, use the shared head node IP as the gateway in the dhcpd.conf file for the compute nodes
  • An issue where the version config file timestamps (versionconfigfiles=yes) are always set to the Unix epoch (1970)
  • An issue where the CMDaemon log file may contain a large number of “MysqlEngine::save, stopped before adding” information messages after an HA takeover
  • An issue where the Slurm’s slurmctld service may crash on “reconfigure” command by CMDaemon when the node count changes

== Cluster Tools ==
=Improvements=

  • Automatically detect environmental proxies in cm-diagnose

== Machine Learning ==
=New Features=

  • Introduced ML package cm-cub-cuda11.7
  • Introduced ML package cm-cudnn8.5-cuda11.7
  • Introduced ML package cm-cutensor-cuda11.7
  • Introduced ML package cm-fastai2--cuda11.7-
  • Introduced ML package cm-gpytorch--cuda11.7-
  • Introduced ML package cm-ml-distdeps-cuda11.7
  • Introduced ML package cm-ml-pythondeps--cuda11.7-
  • Introduced ML package cm-nccl2-cuda11.7-gcc9
  • Introduced ML package cm-onnx-pytorch--cuda11.7-
  • Introduced ML package cm-opencv4--cuda11.7-
  • Introduced ML package cm-pytorch-cuda11.7
  • Introduced ML package cm-pytorch-extra--cuda11.7-
  • Introduced ML package cm-tensorflow2--cuda11.7-
  • Introduced ML package cm-xgboost--cuda11.7-
  • Updated cm-fastai2-* to 2.7.0
  • Updated cm-gcc9-* to 9.5.0
  • Updated cm-gpytorch-* to 1.9.0
  • Updated cm-openmpi4--cuda- to v4.1.4
  • Updated cm-pytorch-* to 1.13.0
  • Updated cm-tensorflow2-* to 2.10.0
  • Updated cm-tensorflow2-* to 2.11.0
  • Updated cm-xgboost-* to 1.6.2
  • Deprecated cm-openmpi4-cuda11.2-ofed47-gcc9 and cm-openmpi4-cuda11.2-ofed51-gcc9 packages
  • Deprecated cm-chainer-py39-cuda11.2-gcc9
  • Deprecated ML packages for CUDA 11.2

== cm-clone-install ==
=New Features=

  • Do not include loop device mounts (if present) when generating the disk setup XML for the head node for cloning when using cm-clone-install

= Fixed Issues=

  • Added wekafs, lustre, gpfs, and other parallel file systems in the list of excluded file systems for cm-clone-install

== cm-kubernetes-setup ==
=Fixed Issues=

  • An issue with the default Kubernetes user role binding template which may allow incorrect use of generated ClusterRoleBindings by Subjects

== cm-scale ==
=Fixed Issues=

  • An issue where cm-scale tries to match the Kubernetes pods or jobs labels to the node’s labels. This is now disabled by default.

== cm-setup ==
=Fixed Issues=

  • Make cm-*-setup configuration file permissions more restrictive

== cmsh ==
=New Features=

  • Allow the --start and --end arguments in rangequery command to be specified as date/time stamps

== jupyter ==
=Fixed Issues=

  • In some cases, an issue with writing the cookie files in the usersâEUR™ home directories with specific ACLs, which can result in JupyterHub login 500 error

== openpbs20 ==
=Fixed Issues=

  • An issue with updating the pbspro/openpbs hooks when a new pbspro/openpbs package is installed

== openpbs22.05 ==
=Fixed Issues=

  • An issue with updating the pbspro/openpbs hooks when a new pbspro/openpbs package is installed

== pbspro2020 ==
=Fixed Issues=

  • An issue with updating the pbspro/openpbs hooks when a new pbspro/openpbs package is installed

== pbspro2021 ==
=Fixed Issues=

  • An issue with updating the pbspro/openpbs hooks when a new pbspro/openpbs package is installed

== pbspro2022 ==
=Fixed Issues=

  • An issue with updating the pbspro/openpbs hooks when a new pbspro/openpbs package is installed

== slurm22.05 ==
=Improvements=

  • Updated Slurm to 22.05.8