Release notes for Nvidia Bright Cluster Manager 9.2-14

Release notes for Bright 9.2-14

== General ==
=New Features=

  • Add cm-list-image-conf-files.py script to list all special files in /cm/conf/
  • Add cuda12.2 packages
  • Add cuda-driver-legacy-470 package to support older datacenter/Tesla GPUs requiring NVIDIA CUDA driver version 470

=Improvements=

  • Preserve files in /cm/images//cm/conf/{node,category}/ while updating images with rsync
  • Remove field for the CPU frequency scaling governor
  • Update cm-openssl package to 3.0.10
  • Update mlnx-ofed58 package to 5.8-3.0.7.0
  • Update mlnx-ofed54 package to 5.4-3.7.5.0
  • Update mlnx-ofed49 package to 4.9-7.1.0.0

=Fixed Issues=

  • Delete duplicate entries in /etc/nginx/nginx.conf

== CMDaemon ==
=Improvements=

  • Allow cm-mig-manage to support GPUs that do not have index = minorID
  • Improved daily cron script to create monthly backup files for the openldap-servers to also include backups older than 1 year
  • Do not populate status for each node in the environment to avoid multiple slow RPCs
  • Redirect all stdout/stderr from a cmburn test script to a log file
  • Add --certificate --key options in cmsh help

=Fixed Issues=

  • Fix killing jobs on a node when CMDaemon is restarted on that node
  • Update node environment cache when automatically changing FS exports
  • Image updates on provisioning nodes now wait for provisioning operations on other nodes to complete before proceeding.
  • Detect xvd* disk in sysinfo
  • Fix help of cmsh cert removerequest command
  • Ensure named gets reloaded when network changes made
  • Fix doPrint call in mounts health check
  • Fix false negative open --failbeforedown when a status value is unchanged
  • Fix typo guage → gauge

== Node Installer ==
=Fixed Issues=

  • Fix booting of compute nodes with separate /usr filesystem

== Cloud ==
=Fixed Issues=

  • Fix various issues with Azure locations caused by Azure API errors
  • Improved support for AWS spot instances

== Kubernetes ==
=Improvements=

  • Update GPU operator to 23.3.2
  • Update Kyverno to 3.0.4 (due to incompatibility with Kubernetes 1.27.x)

=Fixed Issues=

  • NVIDIA GPU Operator deployment always results in NVIDIA packages being installed
  • Update exclude lists for Kubernetes to avoid failures on “grabimage”

== Workload Management ==
=New Features=

  • cm-wlm-setup now installs enroot on login nodes if pyxis is setup

=Improvements=

  • Update slurm23.02 package to 23.02.2
  • Update PMIX to 4.1.3

== Machine Learning ==
=New Features=

  • Add ML package cm-cudnn8.8-cuda*

== Container Registries ==
=Fixed Issues=

  • Generate containerd certificates when a registry mirror is not configured

== Monitoring ==
=New Features=

  • Support for Graphana 10

=Improvements=

  • Reduce memory usage spike when using PromQL over short timespans
  • Multiply metric value by 100 when displaying % in pythoncm

=Fixed Issues=

  • Fix the Slurm job_gpu_utilization and job_gpu_wasted metric calculations when running GPU process within sbatch scripts
  • Fix samplenow CPUUsage metric
  • Ensure first data sample of a Prometheus sampler is stored to the database
  • Fix metrics sampling when temperatures are not provided by the Redfish API