Release Notes for Nvidia Bright Cluster Manager 9.2-9

Release notes for Bright 9.2-9

== General ==
=New Features=

  • Added CUDA 12.0
  • Added cm-hpcx-mlnx-ofed5-cuda11 packages version 2.13.1
  • Updated Kubernetes NVIDIA GPU operator to 22.9.2
  • Updated mlnx-ofed49 to 4.9-6.0.6.0
  • Updated cm-containerd to 1.6.16
  • Updated cm-nvhpc to 22.11
  • Account for loading the nvidia-peermem kernel module in the cuda-driver installation and service scripts

=Fixed Issues=

  • An issue with the cm-cluster-info script when running on non-COD clusters
  • In some cases, an issue with the Ansible head node installer failing to create the default image due to certificates errors
  • An issue with the OFED installation scripts not adding the kernel-core packages to the exclude lists for the operating system package manager, which prevents the OFED kernel modules from loading when the kernel is accidentally upgraded

== CMDaemon ==
=New Features=

  • Allow the option to specify the metric for static network routes in the revision property of the staticroutes entities in cmsh
  • Allow the option to reduce the value of the auto-detected memory which is written by CMDaemon in slurm.conf

=Improvements=

  • Added gpu_temperature:average metric
  • Added REST endpoint allowing to POST updates on the user status message of devices
  • Added sum, max, min, or avg multiple metrics over a given period of time in cmsh
  • Added the option “ResolveToExternalName=yes” to be configured per node or category

=Fixed Issues=

  • An issue with automatic MIG profile configuration for Slurm where the MIG profiles may not be added to GresTypes and AccountingStorageTRES configuration settings
  • An issue with sampling Kubernetes network metrics which are linked to network interfaces
  • An issue where nodes that are in a provisioning role group are not provisioned exclusively by the provisioners for this group
  • Fixed rare deadlock in ec2settings validate
  • In some cases, an issue with creating a backup in /var/spool/cmd/saved-config-files for configuration files modified outside of CMDaemon
  • Rare crash in the monitoring aggregate sampler task
  • Possible CMDaemon deadlock when cloning multiple cloud nodes with several consecutive cmsh sessions
  • An issue where CMDaemon may restart the LDAP service when CMDaemon is restarted
  • An issue where the bond primary=name directive is not written for the underlying physical network interface on Ubuntu
  • Ensure the GPU settings are applied by CMDaemon after a reboot also when the GPU takes a longer-than-normal time to initialize

== Bright View ==
=Fixed Issues=

  • An issue with binding a node to multiple GigaIO composable infrastructure I/O boxes when saving the settings with Bright View

== Node Installer ==
=Improvements=

  • Expose the bond options from the node-installer to the finalize script via environment variables

=Fixed Issues=

  • An issue with setting the type in the ifcfg-ibX configuration files, which in some cases can prevent the InfiniBand interfaces from being brought up
  • An issue with generating initramfs with the mkinitrd_cm.sles script on SLES15 sp3 or sp4 when SLES hpc products are used

== Cluster Tools ==
=Fixed Issues=

  • An issue with setting up cluster extension to Azure in the US Gov cloud

== Head Node Installer ==
=Fixed Issues=

  • In some cases, the loopback interface may be included in the list of available network interfaces in the head node installer

== cm-kubernetes-setup ==
=Improvements=

  • Increased the helm timeout from 5 to 10 minutes, and added an additional retry to reduce failures which can occur in exceptionally slow environments

=Fixed Issues=

  • An issue with installing Kubernetes on DGX, which now uses pre-existing NVIDIA driver packages
  • An issue where cm-kubernetes-setup may not configure containerd with the expected bin directory where cni can be found

== cm-scale ==
=Fixed Issues=

  • An issue with starting several compute nodes for Slurm array tasks with only one cm-scale operation, resulting in the compute nodes being started one-by-one
  • In some cases, an issue with accounting for the available resources for future compute nodes that will be created by cm-scale

== cm-wlm-setup ==
=Fixed Issues=

  • An issue with configuring GPU nodes with cm-wlm-setup, which does not set the correct configuration overlays priorities, resulting in the GPU WLM roles not being assigned to the nodes
  • An issue with NVIDIA GPUs configuration in Slurm when the --nvidia-gpus command line option is passed to cm-wlm-setup

== cmsh ==
=Fixed Issues=

  • Rare crash when cmsh is stopping and was invoked with the ‘-c’ command line option

== pythoncm ==
=Improvements=

  • Use monotonic clock in pythoncm when waiting with a timeout, to avoid issues that can arise from updates in the operating system time

== slurm ==
=Fixed Issues=

  • An issue with upgrading Slurm when pyxis is configured which can overwrite the enroot.conf file and prevent Slurm jobs from running

== slurm22.05 ==
=Improvements=

  • Upgraded Slurm to 22.05.7