Release Notes for Nvidia Bright Cluster Manager 9.2-7

Release notes for Bright 9.2-7

== General==
=New Features=

  • Support for Rocky 9 / RedHat Enterprise Linux 9
  • Support for Ubuntu 22.04
  • Support for Kubernetes version 1.22
  • Add CUDA 11.8 packages
  • Update NVIDIA device plugin to v0.12.3 for Kubernetes 1.24
  • Upgrade pyxis to 0.14.0
  • Update mlnx-ofed54 to version 5.4-

=Known issues=

  • On Rocky 9 / RHEL 9 if there are network connectivity issues during the first boot of the head node, or if the head node is using DHCP for configuring the external network interface, the bind DNS server on the head node may fail resolving Internet addresses, returning SERVFAIL with error messages “broken trust chain” in the log files.
  • To work around the issue, the administrator needs to destroy the managed keys by using “rndc managed-keys destroy; rndc reconfig” so that trust with the Internet servers can be established.

=Deprecated features=

  • OpenShift integration

== CMDaemon==

  • Introduce new CMDaemon advanced configuration flags to disable automatic exports of /home and /cm/shared
  • Add options to exclude the DCGM metric fields and DCGM_FI_DEV_FB_USED_PERCENT for old drivers
  • Improve the start up time of CMDaemon on clusters with a slow user database or with a large number of users
  • Set the Slurm node state to DOWN when CMDaemon detects the node is shutting down

=Fixed Issues=

  • An issue where the cloudjob WLM prolog script can fail and drain the nodes when the cloud compute nodes are booted
  • An issue where cloning a node does not preserve the onnetworkpriority settings for network interfaces
  • An issue with unregistering Kubernetes nodes during head node HA failover when the hostnames contain uppercase characters
  • Ensure the existence of the CUDA DCGM fields before attempting to monitor them, which resolves an issue with CMDaemon not collecting any GPU metrics when older drivers that do not provide all fields are used
  • An issue with removing job queues when using the JobQueue remove pythoncm call
  • An issue where the ib health check script reports “UNKNOWN” if there are vlan and ib interfaces
  • An issue with updating the DefMemPerCPU in the slurm.conf file when the property is updated in CMDaemon
  • An issue with CMDaemon overwriting the jupyterhub certificates on the head node after a reboot if the path is set to /etc/pki
  • An issue where the power script execution environment does not include the CMD_NODE_INSTALLER_PATH variable, preventing custom power script such as from being able to perform power operations

==Bright View==
=Fixed Issues=

  • An issue with the WLM wizard requiring a value in the CPU cores input field when configuring GPU settings, which may prevent the user from proceeding to the next configuration page
  • An issue with showing the monitoring data when selecting the “All Health Checks” page in the “Monitoring” section

==Machine Learning==
=New Features=

  • Updated cm-cub-* packages to v1.17.2

=New Features=

  • Do not include loop device mounts (if present) when generating the disk setup XML for the head node for cloning when using cm-clone-install

=Fixed Issues=

  • Add wekafs and other parallel file systems in the list of excluded file systems for cm-clone-install

=Fixed Issues=

  • An issue with deployment container registry on the passive head node


  • Increased the timeout for Helm operations to avoid installation failures due to slow network connections

=Fixed Issues=

  • An issue with starting nodes for multi-node jobs requesting more memory per node than the available memory divided by the number of requested nodes

=New Features=

  • Allow to setup Pyxis on top of an already existing Slurm setup

=Fixed Issues

  • An issue where the compute nodes are not updated after updating the software images when setting up pyxis/enroot


  • Include the epoch timestamp when using show --verbose
  • Add --update-containers support to the cmsh device foreach command


  • Upgrade Slurm to 22.05.6