Release Notes for Nvidia Bright Cluster Manager 9.1-15

Release notes for Bright 9.1-15

== General ==
=Improvements=

  • Add CUDA 11.8 packages
  • Add mlnx-ofed57 packages for the Mellanox 5.7 OFED stack
  • Ubuntu 20.04: update to 20.04.5.
  • Upgrade pyxis to 0.14.0
  • Update mlnx-ofed54 to version 5.4-3.5.8.0

=Fixed Issues=

  • Automatically start slurmdbd when Slurm configuration is frozen in cmd.conf
  • An issue where the power script execution environment does not include the CMD_NODE_INSTALLER_PATH variable, which prevents custom power scripts (such as ilo_power.pl) from performing power operations

= Deprecated features=

  • OpenShift integration

== CMDaemon ==
=Improvements=

  • Exclude all /snap/.* mount points from the “procmounts” sampler, which otherwise creates unnecessary metrics in CMDaemon
  • Copy the file cluster.csr.new on all headnodes during install-license
  • Increase the default for Kubernetes kubelet’s --max-pods from 50 to 110 for new installations

=Fixed Issues=

  • An issue with removing job queues when using the JobQueue remove pythoncm call
  • An issue with updating the Slurm configuration when the secondary head node is the active head node
  • An issue with the json whoami API call returning a username instead of a profile
  • An issue with removing OSDs from a Ceph cluster if the corresponding OSD nodes are down
  • An issue where the version config file timestamps (versionconfigfiles=yes) are always set to the Unix epoch (1970)
  • An issue where a cloud director power off may hang for up to a minute if the node is already off
  • An issue with merging CMDaemon monitoring execution multiplexers into one, which results in only the last multiplexer taken into account

== Bright View ==
=Fixed Issues=

  • An issue where the main menu is not shown for logged-in users with a read only profile

== Head Node Installer ==
=Fixed Issues=

  • An issue with head node installations with Lmod where the DefaultModules.lua module file is not created by default, resulting in messages about empty LMOD_SYSTEM_DEFAULT_MODULES environment variable

== Machine Learning ==
=New Features=

  • Updated cm-cub-* packages to v1.17.2
  • Deprecated ML package cm-chainer-py39-cuda11.2-gcc9
  • Introduced ML package cm-cutensor-cuda11.7
  • Introduced ML package cm-ml-distdeps-cuda11.7
  • Introduced ML package cm-nccl2-cuda11.7-gcc9
  • Introduced ML package cm-cudnn8.5-cuda11.7

=Improvements=

  • Update cm-openmpi4-* -cuda-* packages to v4.1.4

== cm-clone-install ==
=New Features=

  • Do not include loop device mounts (if present) when generating the disk setup XML for the head node for cloning when using cm-clone-install

== cm-scale ==
=Fixed Issues=

  • An issue with starting nodes for multi-node jobs requesting more memory per node than the available memory divided by the number of requested nodes

== cmsh ==
=Improvements=

  • Add --update-containers support to the cmsh device foreach command

== pbspro2022 ==
=Improvements=

  • Add support for PBS Pro 2022

== slurm21.08 ==
=Fixed Issues=

  • Incorrect path to the failedprejob and allprejob directories, causing the prolog-prejob script to fail

== slurm22.05 ==
=Improvements=

  • Upgrade Slurm to 22.05.6