Release Notes for Nvidia Base Command Manager 10.23.11

== General ==
== New Features ==

  • Added support for SLES15 SP5

== Improvements==

  • Changed NVIDIA Container Toolkit default values for accept-nvidia-visible-devices-as-volume-mounts (false → true) and accept-nvidia-visible-devices-envvar-when-unprivileged (true → false)
  • Updated cuda-driver package to 535.129.03

== CMDaemon ==
==New Features==

  • Added a cmsh command (wlm grid) to create a timelapse view of the jobs that have run
  • Added a special default gateway value (255.255.255.255) to use the one provided by dhcpd
  • Added cmsh command to show dhcpd leases
  • Added Border Gateway Protocol (BGP) overview for Cumulus switches
  • Added Link Layer Discovery Protocol (LLDP) overview for Cumulus switches
  • Added bootstrap.pem and signature checks in cm-check-certificates and switched from MD5 to SHA1

==Improvements==

  • Allow nodes to be automatically powered off or reset upon installer failure
  • Allow devices to be identified by serial in DHCP
  • Relaxed SSL checks when registering a new Cumulus switch via ZTP
  • Improved CMDaemon startup speed in HA mode
  • Prevent multiple identical failover group status
  • Added a flag to allow changing a user home directory to an existing directory
  • Added a flag to allow pythoncm.cluster to allow entity.commit without suffering from update-race-conditions
  • Write chrony.conf instead of ntp.conf in node-installer on RHEL9
  • Allow role exclude list entries for provisioning to be removed using exclude list snippets starting with ‘+’

==Fixed Issues==

  • Fixed counting of nodes and accelerators towards the license limit
  • Fixed service status in cmsh of a lite-node
  • Fixed crash in ArchOSInfo::is_arch_os when cm-config-os-arch is not installed on the head node
  • Store services added to lite-node to DB
  • Fixed cmsh imageupdate --pattern

== Workload Management ==
==New Features==

  • Automatically configure non-MIG GPUs in Slurm when detected
  • Updated slurm23.02 packages to version 23.02.6 (CVE-2023-41914)
  • Added new package pyxis-sources to allow building pyxis in air-gapped environments

==Improvements==

  • Allow the management of jobs even if one of the nodes has an incorrect configuration in slurm.conf

==Fixed Issues==

  • Fixed configuring AutoDetect in slurm.conf if GRES is set with addtogresconf=no in the slurm client role
  • Cleaned up database node entries of Slurm jobs that were requeued
  • Fixed pyxis epilog failure when unpacked images are shared and user does not specify a container name
  • Install enroot dependencies on Ubuntu 20.04

== Container Engines ==
==Improvements==

  • Stopped using deprecated upstream Kubernetes repositories (versions 1.23 and older are no longer available)
  • Introduced support for RAPIDS Accelerator for Apache Spark in the Jupyter kernel templates

== Monitoring ==
==New Features==

  • Collect new DCGM metrics: DCGM_FI_DEV_POWER_VIOLATION and DCGM_FI_DEV_THERMAL_VIOLATION
  • Added ManagedServicesOk health check to lite devices

==Improvements==

  • Increased the variability and frequency of the ssh2node healthcheck to reduce load on the head nodes
  • Optimized startup of compute nodes in clusters with a large number of nodes and many monitored jobs
  • Do not use linear interpolation for health check data, but rather the last known value

==Fixed Issues==

  • Fixed a monitoring bug which prevented new device metrics from being saved to the database if CMDaemon on the head node was restarted right after they were created
  • Fixed job-metrics in the base-view monitoring tree