Release notes for Bright 9.2-14
== General ==
=New Features=
- Add cm-list-image-conf-files.py script to list all special files in /cm/conf/
- Add cuda12.2 packages
- Add cuda-driver-legacy-470 package to support older datacenter/Tesla GPUs requiring NVIDIA CUDA driver version 470
=Improvements=
- Preserve files in /cm/images//cm/conf/{node,category}/ while updating images with rsync
- Remove field for the CPU frequency scaling governor
- Update cm-openssl package to 3.0.10
- Update mlnx-ofed58 package to 5.8-3.0.7.0
- Update mlnx-ofed54 package to 5.4-3.7.5.0
- Update mlnx-ofed49 package to 4.9-7.1.0.0
=Fixed Issues=
- Delete duplicate entries in /etc/nginx/nginx.conf
== CMDaemon ==
=Improvements=
- Allow cm-mig-manage to support GPUs that do not have index = minorID
- Improved daily cron script to create monthly backup files for the openldap-servers to also include backups older than 1 year
- Do not populate status for each node in the environment to avoid multiple slow RPCs
- Redirect all stdout/stderr from a cmburn test script to a log file
- Add --certificate --key options in cmsh help
=Fixed Issues=
- Fix killing jobs on a node when CMDaemon is restarted on that node
- Update node environment cache when automatically changing FS exports
- Image updates on provisioning nodes now wait for provisioning operations on other nodes to complete before proceeding.
- Detect xvd* disk in sysinfo
- Fix help of cmsh cert removerequest command
- Ensure named gets reloaded when network changes made
- Fix doPrint call in mounts health check
- Fix false negative open --failbeforedown when a status value is unchanged
- Fix typo guage → gauge
== Node Installer ==
=Fixed Issues=
- Fix booting of compute nodes with separate /usr filesystem
== Cloud ==
=Fixed Issues=
- Fix various issues with Azure locations caused by Azure API errors
- Improved support for AWS spot instances
== Kubernetes ==
=Improvements=
- Update GPU operator to 23.3.2
- Update Kyverno to 3.0.4 (due to incompatibility with Kubernetes 1.27.x)
=Fixed Issues=
- NVIDIA GPU Operator deployment always results in NVIDIA packages being installed
- Update exclude lists for Kubernetes to avoid failures on “grabimage”
== Workload Management ==
=New Features=
- cm-wlm-setup now installs enroot on login nodes if pyxis is setup
=Improvements=
- Update slurm23.02 package to 23.02.2
- Update PMIX to 4.1.3
== Machine Learning ==
=New Features=
- Add ML package cm-cudnn8.8-cuda*
== Container Registries ==
=Fixed Issues=
- Generate containerd certificates when a registry mirror is not configured
== Monitoring ==
=New Features=
- Support for Graphana 10
=Improvements=
- Reduce memory usage spike when using PromQL over short timespans
- Multiply metric value by 100 when displaying % in pythoncm
=Fixed Issues=
- Fix the Slurm job_gpu_utilization and job_gpu_wasted metric calculations when running GPU process within sbatch scripts
- Fix samplenow CPUUsage metric
- Ensure first data sample of a Prometheus sampler is stored to the database
- Fix metrics sampling when temperatures are not provided by the Redfish API