Release Notes for Nvidia Bright Cluster Manager 9.1-16

Release notes for Bright 9.1-16

== General ==
=New Features=

  • Added cm-hpcx-mlnx-ofed5-cuda11 packages
  • Added CUDA 12.0 packages


  • Updated python2-pyasn1 to 0.4.2-3.2.1
  • Updated cm-nvhpc to 23.1
  • Updated mlnx-ofed49 to 4.9-
  • Updated cm-nvhpc to 22.11
  • Account for loading the nvidia-peermem kernel module in the cuda-driver installation and service scripts

=Fixed Issues=

  • An issue with the OFED installation scripts not adding the kernel-core packages to the exclude lists for the operating system package manager, which prevents the OFED kernel modules from loading when the kernel is accidentally upgraded

== CMDaemon ==
=New Features=

  • Added cm-package-release-info tool, which can be used to determine the Bright 9.X-Y version of the installed packages


  • Allow the option to specify “inet manual” for network interfaces on Ubuntu base distribution (set revision inet=manual)
  • Allow the option “ResolveToExternalName=yes” to be configured per node or category
  • Allow for special OID for PDU load to be specified via the revision property
  • In an HA setup, use the shared head node IP as the gateway in the dhcpd.conf file for the compute nodes

=Fixed Issues=

  • An issue with sampling Kubernetes network metrics which are linked to network interfaces
  • An issue with the automatic switch of the monitoring node when the passive head node goes down for a prolonged period of time
  • A CMDaemon memory leak when the Slurm placeholders maxnodes value is less than nodes in the queue
  • A rare crash in the monitoring aggregate sampler task
  • An issue where CMDaemon may restart the LDAP service when CMDaemon is restarted
  • Ensure the GPU settings are applied by CMDaemon after a reboot also when the GPU takes longer time to initialize
  • In 9.1-14 a change in the implementation of CMDaemon introduced a different behavior when generating the Kubernetes config files for users with rolebindings, where the current namespace for the user became the first namespace that the user has binding for. The behavior of CMDaemon is now updated to match the pre-9.1-14 implementation, where the current namespace is the user’s “restricted” namespace
  • An issue where the monitoringdrop command may drop the data only for the head node
  • An issue with creating the LSF configuration when some node is converted from a compute to a submit-only host

== Node Installer ==

  • Expose the bond options from the node-installer to the finalize script via environment variables

=Fixed Issues=

  • An issue with generating initramfs with the mkinitrd_cm.sles script on SLES15 sp3 or sp4 when SLES hpc products are used

== Cluster Tools ==

  • Automatically detect environmental proxies in cm-diagnose

=Fixed Issues=

  • An issue with setting up cluster extension to Azure Government (US) cloud

== Machine Learning ==
=New Features=

  • Introduced ML package cm-cub-cuda11.7
  • Introduced ML package cm-fastai2--cuda11.7-
  • Introduced ML package cm-gpytorch--cuda11.7-
  • Introduced ML package cm-ml-pythondeps--cuda11.7-
  • Introduced ML package cm-onnx-pytorch--cuda11.7-
  • Introduced ML package cm-opencv4--cuda11.7-
  • Introduced ML package cm-pytorch-cuda11.7
  • Introduced ML package cm-pytorch-extra--cuda11.7-
  • Introduced ML package cm-tensorflow2--cuda11.7-
  • Introduced ML package cm-xgboost--cuda11.7-
  • Updated cm-fastai2-* to 2.7.0
  • Updated cm-gcc9-* to 9.5.0
  • Updated cm-gpytorch-* to 1.9.0
  • Updated cm-pytorch-* to 1.13.0
  • Updated cm-tensorflow2-* to 2.11.0
  • Updated cm-xgboost-* to 1.6.2

=Changes and Deprecated Features=

  • Deprecated ML packages for CUDA 11.2 and introduced new variants for CUDA 11.7
  • Deprecated cm-openmpi4-cuda11.2-ofed47-gcc9 and cm-openmpi4-cuda11.2-ofed51-gcc9 packages

== cm-clone-install ==
=Fixed Issues=

  • Add wekafs, lustre, gpfs, and other parallel file systems in the list of excluded file systems for cm-clone-install

== cm-scale ==
=Fixed Issues=

  • An issue where cm-scale tries to match the Kubernetes pods or jobs labels to the node’s labels. This is now disabled by default

== cm-wlm-setup ==
=Fixed Issues=

  • An issue with configuring GPU nodes with cm-wlm-setup, which does not set the correct configuration overlays priorities, resulting in the GPU WLM roles not being assigned to the nodes

== cmsh ==
=Fixed Issues=

  • An issue where the rshell cmsh command reports an error “Failed to connect” when it is used for a software image

== jupyter ==

  • Allow the option to sort the Slurm jobs table in the Jupyter web interface

=Fixed Issues=

  • In some cases, an issue with writing the cookie files in the users’ home directories with specific ACLs, which can result in JupyterHub login 500 error

== licensing ==
=Fixed Issues=

  • Remove the license expiration warning on the secondary head node after installing a new license

== slurm22.05 ==

  • Updated Slurm to 22.05.8