Release notes for Nvidia Bright Cluster Manager 9.2-14

kwoods · October 3, 2023, 6:23pm

Release notes for Bright 9.2-14

== General ==
=New Features=

Add cm-list-image-conf-files.py script to list all special files in /cm/conf/
Add cuda12.2 packages
Add cuda-driver-legacy-470 package to support older datacenter/Tesla GPUs requiring NVIDIA CUDA driver version 470

=Improvements=

Preserve files in /cm/images//cm/conf/{node,category}/ while updating images with rsync
Remove field for the CPU frequency scaling governor
Update cm-openssl package to 3.0.10
Update mlnx-ofed58 package to 5.8-3.0.7.0
Update mlnx-ofed54 package to 5.4-3.7.5.0
Update mlnx-ofed49 package to 4.9-7.1.0.0

=Fixed Issues=

== CMDaemon ==
=Improvements=

Allow cm-mig-manage to support GPUs that do not have index = minorID
Improved daily cron script to create monthly backup files for the openldap-servers to also include backups older than 1 year
Do not populate status for each node in the environment to avoid multiple slow RPCs
Redirect all stdout/stderr from a cmburn test script to a log file
Add --certificate --key options in cmsh help

=Fixed Issues=

Fix killing jobs on a node when CMDaemon is restarted on that node
Update node environment cache when automatically changing FS exports
Image updates on provisioning nodes now wait for provisioning operations on other nodes to complete before proceeding.
Detect xvd* disk in sysinfo
Fix help of cmsh cert removerequest command
Ensure named gets reloaded when network changes made
Fix doPrint call in mounts health check
Fix false negative open --failbeforedown when a status value is unchanged
Fix typo guage → gauge

== Node Installer ==
=Fixed Issues=

== Cloud ==
=Fixed Issues=

== Kubernetes ==
=Improvements=

=Fixed Issues=

NVIDIA GPU Operator deployment always results in NVIDIA packages being installed
Update exclude lists for Kubernetes to avoid failures on “grabimage”

== Workload Management ==
=New Features=

=Improvements=

== Machine Learning ==
=New Features=

== Container Registries ==
=Fixed Issues=

== Monitoring ==
=New Features=

=Improvements=

=Fixed Issues=

Fix the Slurm job_gpu_utilization and job_gpu_wasted metric calculations when running GPU process within sbatch scripts
Fix samplenow CPUUsage metric
Ensure first data sample of a Prometheus sampler is stored to the database
Fix metrics sampling when temperatures are not provided by the Redfish API

Topic		Replies	Views
Release Notes for Nvidia Bright Cluster Manager 9.2-9 Base Command Manager bright , cluster-management , cluster-manager	1	1019	February 20, 2023
Release Notes for Nvidia Bright Cluster Manager 9.0-19 Base Command Manager bright , cluster-management , cluster-manager	1	905	January 2, 2023
Release Notes for Nvidia Bright Cluster Manager 9.1-16 Base Command Manager bright , cluster-management , cluster-manager	1	996	March 2, 2023
Release Notes for Nvidia Bright Cluster Manager 9.2-11 Base Command Manager bright , cluster-management , cluster-manager	1	992	June 13, 2023
Release notes for Nvidia Bright Cluster Manager 9.2-13 Base Command Manager bright , cluster-management , cluster-manager	0	595	July 17, 2023
Release Notes for Nvidia Bright Cluster Manager 9.2-10 Base Command Manager bright , cluster-management , cluster-manager	1	1016	June 13, 2023
Release Notes for Nvidia Bright Cluster Manager 9.0-20 Base Command Manager bright , cluster-management , cluster-manager	1	786	June 13, 2023
Release Notes for Nvidia Bright Cluster Manager 9.2-12 Base Command Manager bright , cluster-management , cluster-manager	1	700	June 13, 2023
Release Notes for Nvidia Bright Cluster Manager 9.2-5 Base Command Manager bright , cluster-management , cluster-manager	1	973	November 1, 2022
Release Notes for Nvidia Bright Cluster Manager 9.1-14 Base Command Manager bright , cluster-management , cluster-manager	1	1106	November 1, 2022