== General ==
== New Features ==
- Added support for SLES15 SP5
== Improvements==
- Changed NVIDIA Container Toolkit default values for accept-nvidia-visible-devices-as-volume-mounts (false → true) and accept-nvidia-visible-devices-envvar-when-unprivileged (true → false)
- Updated cuda-driver package to 535.129.03
== CMDaemon ==
==New Features==
- Added a cmsh command (wlm grid) to create a timelapse view of the jobs that have run
- Added a special default gateway value (255.255.255.255) to use the one provided by dhcpd
- Added cmsh command to show dhcpd leases
- Added Border Gateway Protocol (BGP) overview for Cumulus switches
- Added Link Layer Discovery Protocol (LLDP) overview for Cumulus switches
- Added bootstrap.pem and signature checks in cm-check-certificates and switched from MD5 to SHA1
==Improvements==
- Allow nodes to be automatically powered off or reset upon installer failure
- Allow devices to be identified by serial in DHCP
- Relaxed SSL checks when registering a new Cumulus switch via ZTP
- Improved CMDaemon startup speed in HA mode
- Prevent multiple identical failover group status
- Added a flag to allow changing a user home directory to an existing directory
- Added a flag to allow pythoncm.cluster to allow entity.commit without suffering from update-race-conditions
- Write chrony.conf instead of ntp.conf in node-installer on RHEL9
- Allow role exclude list entries for provisioning to be removed using exclude list snippets starting with ‘+’
==Fixed Issues==
- Fixed counting of nodes and accelerators towards the license limit
- Fixed service status in cmsh of a lite-node
- Fixed crash in ArchOSInfo::is_arch_os when cm-config-os-arch is not installed on the head node
- Store services added to lite-node to DB
- Fixed cmsh imageupdate --pattern
== Workload Management ==
==New Features==
- Automatically configure non-MIG GPUs in Slurm when detected
- Updated slurm23.02 packages to version 23.02.6 (CVE-2023-41914)
- Added new package pyxis-sources to allow building pyxis in air-gapped environments
==Improvements==
- Allow the management of jobs even if one of the nodes has an incorrect configuration in slurm.conf
==Fixed Issues==
- Fixed configuring AutoDetect in slurm.conf if GRES is set with addtogresconf=no in the slurm client role
- Cleaned up database node entries of Slurm jobs that were requeued
- Fixed pyxis epilog failure when unpacked images are shared and user does not specify a container name
- Install enroot dependencies on Ubuntu 20.04
== Container Engines ==
==Improvements==
- Stopped using deprecated upstream Kubernetes repositories (versions 1.23 and older are no longer available)
- Introduced support for RAPIDS Accelerator for Apache Spark in the Jupyter kernel templates
== Monitoring ==
==New Features==
- Collect new DCGM metrics: DCGM_FI_DEV_POWER_VIOLATION and DCGM_FI_DEV_THERMAL_VIOLATION
- Added ManagedServicesOk health check to lite devices
==Improvements==
- Increased the variability and frequency of the ssh2node healthcheck to reduce load on the head nodes
- Optimized startup of compute nodes in clusters with a large number of nodes and many monitored jobs
- Do not use linear interpolation for health check data, but rather the last known value
==Fixed Issues==
- Fixed a monitoring bug which prevented new device metrics from being saved to the database if CMDaemon on the head node was restarted right after they were created
- Fixed job-metrics in the base-view monitoring tree