Release Notes for Nvidia Bright Cluster Manager 9.2-9

kwoods · February 20, 2023, 2:39pm

Release notes for Bright 9.2-9

== General ==
=New Features=

Added CUDA 12.0
Added cm-hpcx-mlnx-ofed5-cuda11 packages version 2.13.1
Updated Kubernetes NVIDIA GPU operator to 22.9.2
Updated mlnx-ofed49 to 4.9-6.0.6.0
Updated cm-containerd to 1.6.16
Updated cm-nvhpc to 22.11
Account for loading the nvidia-peermem kernel module in the cuda-driver installation and service scripts

=Fixed Issues=

An issue with the cm-cluster-info script when running on non-COD clusters
In some cases, an issue with the Ansible head node installer failing to create the default image due to certificates errors
An issue with the OFED installation scripts not adding the kernel-core packages to the exclude lists for the operating system package manager, which prevents the OFED kernel modules from loading when the kernel is accidentally upgraded

== CMDaemon ==
=New Features=

Allow the option to specify the metric for static network routes in the revision property of the staticroutes entities in cmsh
Allow the option to reduce the value of the auto-detected memory which is written by CMDaemon in slurm.conf

=Improvements=

Added gpu_temperature:average metric
Added REST endpoint allowing to POST updates on the user status message of devices
Added sum, max, min, or avg multiple metrics over a given period of time in cmsh
Added the option “ResolveToExternalName=yes” to be configured per node or category

=Fixed Issues=

An issue with automatic MIG profile configuration for Slurm where the MIG profiles may not be added to GresTypes and AccountingStorageTRES configuration settings
An issue with sampling Kubernetes network metrics which are linked to network interfaces
An issue where nodes that are in a provisioning role group are not provisioned exclusively by the provisioners for this group
Fixed rare deadlock in ec2settings validate
In some cases, an issue with creating a backup in /var/spool/cmd/saved-config-files for configuration files modified outside of CMDaemon
Rare crash in the monitoring aggregate sampler task
Possible CMDaemon deadlock when cloning multiple cloud nodes with several consecutive cmsh sessions
An issue where CMDaemon may restart the LDAP service when CMDaemon is restarted
An issue where the bond primary=name directive is not written for the underlying physical network interface on Ubuntu
Ensure the GPU settings are applied by CMDaemon after a reboot also when the GPU takes a longer-than-normal time to initialize

== Bright View ==
=Fixed Issues=

An issue with binding a node to multiple GigaIO composable infrastructure I/O boxes when saving the settings with Bright View

== Node Installer ==
=Improvements=

Expose the bond options from the node-installer to the finalize script via environment variables

=Fixed Issues=

An issue with setting the type in the ifcfg-ibX configuration files, which in some cases can prevent the InfiniBand interfaces from being brought up
An issue with generating initramfs with the mkinitrd_cm.sles script on SLES15 sp3 or sp4 when SLES hpc products are used

== Cluster Tools ==
=Fixed Issues=

An issue with setting up cluster extension to Azure in the US Gov cloud

== Head Node Installer ==
=Fixed Issues=

In some cases, the loopback interface may be included in the list of available network interfaces in the head node installer

== cm-kubernetes-setup ==
=Improvements=

Increased the helm timeout from 5 to 10 minutes, and added an additional retry to reduce failures which can occur in exceptionally slow environments

=Fixed Issues=

An issue with installing Kubernetes on DGX, which now uses pre-existing NVIDIA driver packages
An issue where cm-kubernetes-setup may not configure containerd with the expected bin directory where cni can be found

== cm-scale ==
=Fixed Issues=

An issue with starting several compute nodes for Slurm array tasks with only one cm-scale operation, resulting in the compute nodes being started one-by-one
In some cases, an issue with accounting for the available resources for future compute nodes that will be created by cm-scale

== cm-wlm-setup ==
=Fixed Issues=

An issue with configuring GPU nodes with cm-wlm-setup, which does not set the correct configuration overlays priorities, resulting in the GPU WLM roles not being assigned to the nodes
An issue with NVIDIA GPUs configuration in Slurm when the --nvidia-gpus command line option is passed to cm-wlm-setup

== cmsh ==
=Fixed Issues=

Rare crash when cmsh is stopping and was invoked with the ‘-c’ command line option

== pythoncm ==
=Improvements=

Use monotonic clock in pythoncm when waiting with a timeout, to avoid issues that can arise from updates in the operating system time

== slurm ==
=Fixed Issues=

An issue with upgrading Slurm when pyxis is configured which can overwrite the enroot.conf file and prevent Slurm jobs from running

== slurm22.05 ==
=Improvements=

Upgraded Slurm to 22.05.7

Topic		Replies	Views
Release Notes for Nvidia Bright Cluster Manager 9.1-16 Nvidia Bright Cluster Manager bright , cluster-management , cluster-manager	1	964	March 2, 2023
Release Notes for Nvidia Bright Cluster Manager 9.2-11 Nvidia Bright Cluster Manager bright , cluster-management , cluster-manager	1	922	June 13, 2023
Release Notes for Nvidia Bright Cluster Manager 9.2-10 Nvidia Bright Cluster Manager bright , cluster-management , cluster-manager	1	993	June 13, 2023
Release Notes for Nvidia Bright Cluster Manager 9.2-5 Nvidia Bright Cluster Manager bright , cluster-management , cluster-manager	1	957	November 1, 2022
Release Notes for Nvidia Bright Cluster Manager 9.1-14 Nvidia Bright Cluster Manager bright , cluster-management , cluster-manager	1	1088	November 1, 2022
Release Notes for Nvidia Bright Cluster Manager 9.0-20 Nvidia Bright Cluster Manager bright , cluster-management , cluster-manager	1	770	June 13, 2023
Release Notes for Nvidia Bright Cluster Manager 9.1-17 Nvidia Bright Cluster Manager bright , cluster-management , cluster-manager	0	639	August 14, 2023
Release Notes for Nvidia Bright Cluster Manager 9.2-6 Nvidia Bright Cluster Manager bright , cluster-management , cluster-manager	1	784	November 1, 2022
Release Notes for Nvidia Bright Cluster Manager 9.2-7 Nvidia Bright Cluster Manager bright , cluster-management , cluster-manager	1	1161	December 22, 2022
Release Notes for Nvidia Bright Cluster Manager 9.1-15 Nvidia Bright Cluster Manager bright , cluster-management , cluster-manager	1	809	December 22, 2022

Release Notes for Nvidia Bright Cluster Manager 9.2-9

Related topics