Release notes for Bright 9.2-9
== General ==
=New Features=
- Added CUDA 12.0
- Added cm-hpcx-mlnx-ofed5-cuda11 packages version 2.13.1
- Updated Kubernetes NVIDIA GPU operator to 22.9.2
- Updated mlnx-ofed49 to 4.9-6.0.6.0
- Updated cm-containerd to 1.6.16
- Updated cm-nvhpc to 22.11
- Account for loading the nvidia-peermem kernel module in the cuda-driver installation and service scripts
=Fixed Issues=
- An issue with the cm-cluster-info script when running on non-COD clusters
- In some cases, an issue with the Ansible head node installer failing to create the default image due to certificates errors
- An issue with the OFED installation scripts not adding the kernel-core packages to the exclude lists for the operating system package manager, which prevents the OFED kernel modules from loading when the kernel is accidentally upgraded
== CMDaemon ==
=New Features=
- Allow the option to specify the metric for static network routes in the revision property of the staticroutes entities in cmsh
- Allow the option to reduce the value of the auto-detected memory which is written by CMDaemon in slurm.conf
=Improvements=
- Added gpu_temperature:average metric
- Added REST endpoint allowing to POST updates on the user status message of devices
- Added sum, max, min, or avg multiple metrics over a given period of time in cmsh
- Added the option “ResolveToExternalName=yes” to be configured per node or category
=Fixed Issues=
- An issue with automatic MIG profile configuration for Slurm where the MIG profiles may not be added to GresTypes and AccountingStorageTRES configuration settings
- An issue with sampling Kubernetes network metrics which are linked to network interfaces
- An issue where nodes that are in a provisioning role group are not provisioned exclusively by the provisioners for this group
- Fixed rare deadlock in ec2settings validate
- In some cases, an issue with creating a backup in /var/spool/cmd/saved-config-files for configuration files modified outside of CMDaemon
- Rare crash in the monitoring aggregate sampler task
- Possible CMDaemon deadlock when cloning multiple cloud nodes with several consecutive cmsh sessions
- An issue where CMDaemon may restart the LDAP service when CMDaemon is restarted
- An issue where the bond primary=name directive is not written for the underlying physical network interface on Ubuntu
- Ensure the GPU settings are applied by CMDaemon after a reboot also when the GPU takes a longer-than-normal time to initialize
== Bright View ==
=Fixed Issues=
- An issue with binding a node to multiple GigaIO composable infrastructure I/O boxes when saving the settings with Bright View
== Node Installer ==
=Improvements=
- Expose the bond options from the node-installer to the finalize script via environment variables
=Fixed Issues=
- An issue with setting the type in the ifcfg-ibX configuration files, which in some cases can prevent the InfiniBand interfaces from being brought up
- An issue with generating initramfs with the mkinitrd_cm.sles script on SLES15 sp3 or sp4 when SLES hpc products are used
== Cluster Tools ==
=Fixed Issues=
- An issue with setting up cluster extension to Azure in the US Gov cloud
== Head Node Installer ==
=Fixed Issues=
- In some cases, the loopback interface may be included in the list of available network interfaces in the head node installer
== cm-kubernetes-setup ==
=Improvements=
- Increased the helm timeout from 5 to 10 minutes, and added an additional retry to reduce failures which can occur in exceptionally slow environments
=Fixed Issues=
- An issue with installing Kubernetes on DGX, which now uses pre-existing NVIDIA driver packages
- An issue where cm-kubernetes-setup may not configure containerd with the expected bin directory where cni can be found
== cm-scale ==
=Fixed Issues=
- An issue with starting several compute nodes for Slurm array tasks with only one cm-scale operation, resulting in the compute nodes being started one-by-one
- In some cases, an issue with accounting for the available resources for future compute nodes that will be created by cm-scale
== cm-wlm-setup ==
=Fixed Issues=
- An issue with configuring GPU nodes with cm-wlm-setup, which does not set the correct configuration overlays priorities, resulting in the GPU WLM roles not being assigned to the nodes
- An issue with NVIDIA GPUs configuration in Slurm when the --nvidia-gpus command line option is passed to cm-wlm-setup
== cmsh ==
=Fixed Issues=
- Rare crash when cmsh is stopping and was invoked with the ‘-c’ command line option
== pythoncm ==
=Improvements=
- Use monotonic clock in pythoncm when waiting with a timeout, to avoid issues that can arise from updates in the operating system time
== slurm ==
=Fixed Issues=
- An issue with upgrading Slurm when pyxis is configured which can overwrite the enroot.conf file and prevent Slurm jobs from running
== slurm22.05 ==
=Improvements=
- Upgraded Slurm to 22.05.7