Release Notes for Nvidia Base Command Manager 10.23.11

kwoods · November 17, 2023, 4:42pm

== General ==
== New Features ==

== Improvements==

Changed NVIDIA Container Toolkit default values for accept-nvidia-visible-devices-as-volume-mounts (false → true) and accept-nvidia-visible-devices-envvar-when-unprivileged (true → false)
Updated cuda-driver package to 535.129.03

== CMDaemon ==
==New Features==

Added a cmsh command (wlm grid) to create a timelapse view of the jobs that have run
Added a special default gateway value (255.255.255.255) to use the one provided by dhcpd
Added cmsh command to show dhcpd leases
Added Border Gateway Protocol (BGP) overview for Cumulus switches
Added Link Layer Discovery Protocol (LLDP) overview for Cumulus switches
Added bootstrap.pem and signature checks in cm-check-certificates and switched from MD5 to SHA1

==Improvements==

Allow nodes to be automatically powered off or reset upon installer failure
Allow devices to be identified by serial in DHCP
Relaxed SSL checks when registering a new Cumulus switch via ZTP
Improved CMDaemon startup speed in HA mode
Prevent multiple identical failover group status
Added a flag to allow changing a user home directory to an existing directory
Added a flag to allow pythoncm.cluster to allow entity.commit without suffering from update-race-conditions
Write chrony.conf instead of ntp.conf in node-installer on RHEL9
Allow role exclude list entries for provisioning to be removed using exclude list snippets starting with ‘+’

==Fixed Issues==

Fixed counting of nodes and accelerators towards the license limit
Fixed service status in cmsh of a lite-node
Fixed crash in ArchOSInfo::is_arch_os when cm-config-os-arch is not installed on the head node
Store services added to lite-node to DB
Fixed cmsh imageupdate --pattern

== Workload Management ==
==New Features==

Automatically configure non-MIG GPUs in Slurm when detected
Updated slurm23.02 packages to version 23.02.6 (CVE-2023-41914)
Added new package pyxis-sources to allow building pyxis in air-gapped environments

==Improvements==

Allow the management of jobs even if one of the nodes has an incorrect configuration in slurm.conf

==Fixed Issues==

Fixed configuring AutoDetect in slurm.conf if GRES is set with addtogresconf=no in the slurm client role
Cleaned up database node entries of Slurm jobs that were requeued
Fixed pyxis epilog failure when unpacked images are shared and user does not specify a container name
Install enroot dependencies on Ubuntu 20.04

== Container Engines ==
==Improvements==

Stopped using deprecated upstream Kubernetes repositories (versions 1.23 and older are no longer available)
Introduced support for RAPIDS Accelerator for Apache Spark in the Jupyter kernel templates

== Monitoring ==
==New Features==

Collect new DCGM metrics: DCGM_FI_DEV_POWER_VIOLATION and DCGM_FI_DEV_THERMAL_VIOLATION
Added ManagedServicesOk health check to lite devices

==Improvements==

Increased the variability and frequency of the ssh2node healthcheck to reduce load on the head nodes
Optimized startup of compute nodes in clusters with a large number of nodes and many monitored jobs
Do not use linear interpolation for health check data, but rather the last known value

==Fixed Issues==

Fixed a monitoring bug which prevented new device metrics from being saved to the database if CMDaemon on the head node was restarted right after they were created
Fixed job-metrics in the base-view monitoring tree

Topic		Replies	Views
Release Notes for Nvidia Base Command Manager 10.23.10 Base Command Manager bright , cluster-management , cluster-manager	0	779	October 24, 2023
Release Notes for Nvidia Bright Cluster Manager 9.2-5 Base Command Manager bright , cluster-management , cluster-manager	1	962	November 1, 2022
Release Notes for Nvidia Bright Cluster Manager 9.1-15 Base Command Manager bright , cluster-management , cluster-manager	1	812	December 22, 2022
Release Notes for Nvidia Base Command Manager 10.23.09 Base Command Manager	0	1554	October 3, 2023
Release Notes for Nvidia Bright Cluster Manager 9.2-11 Base Command Manager bright , cluster-management , cluster-manager	1	964	June 13, 2023
Release Notes for Nvidia Bright Cluster Manager 9.1-17 Base Command Manager bright , cluster-management , cluster-manager	0	649	August 14, 2023
Release Notes for Nvidia Bright Cluster Manager 9.2-6 Base Command Manager bright , cluster-management , cluster-manager	1	785	November 1, 2022
Release Notes for Nvidia Bright Cluster Manager 9.2-12 Base Command Manager bright , cluster-management , cluster-manager	1	691	June 13, 2023
Release Notes for Nvidia Bright Cluster Manager 9.1-14 Base Command Manager bright , cluster-management , cluster-manager	1	1099	November 1, 2022
Release Notes for Nvidia Bright Cluster Manager 9.2-10 Base Command Manager bright , cluster-management , cluster-manager	1	1002	June 13, 2023