Release Notes for Nvidia Base Command Manager 10.23.10

== General ==
== New Features

  • Added mlnx-ofed23.07 package
  • Added cm-pmix4 package

== Improvements ==

  • Added drainstatus to cm-diagnose
  • Updated cuda-driver package to 535.104.12
  • Updated cm-libprometheus package to 0.47.0
  • Updated cm-openssl package to 3.1.3

== CMDaemon ==
== New Features ==

  • Added advanced config flag DisableRemoteShell to disable all remote shell RPC
  • Added events for Cumulus service management operations

== Improvements ==

  • Added cmsh clone device option to increment IP addresses by values other than 1
  • Allow lite node IP to be set during cmsh device add
  • Display an error when setting an invalid software image in cmsh
  • Update /etc/resolv.conf via netconfig on SLES15 instead of writing file
  • Created the ability to add model/serial number information to new switches (ZTP)
  • Kill active ramdisk create process when software image is removed

== Fixed Issues ==

  • Fixed provisioning trigger when an image name starts with the name of another image
  • Allow cm-cmd-ports --get to work without an active cmd
  • Prevent “Reboot required: Interfaces have been modified” event from being shown for a node if the node has a VLAN interface on a Bridge interface that includes a bond interface
  • Fixed cm-burn unsuccessful completion in the absence of both a pre and post section
  • Image updates on provisioning nodes now wait for provisioning operations on other nodes to complete before proceeding.
  • Allow appending or skipping adding a Slurm drain reason when healthcheck fails with drain action enabled
  • Fixed crash of pythoncm parallel node termination function
  • Fixed an edge case that causes hostlist generation failures when there are 3 numeric fields in the hostname
  • Fixed service management for cm-lite-daemon

== cm-scale ==
== Fixed Issues ==

  • Allow to start terminated cloud nodes whose state is one of the node installer ones
  • Terminate useless AWS spot instance requests
  • Fixed the termination of cloud nodes when multiple clone operations are issued in parallel
  • Fixed the startup of nodes by cm-scale if Slurm job predicted start time is set by Slurm in the future
  • Fixed handling of job arrays with range from 1 to >1 figure number

== Cloud ==
== New Features ==

  • Added support for AWS FSx on Ubuntu for cmjob

== Improvements ==

  • Improved error message when starting a cloud node with incorrect VPC/subnet configuration

== Fixed Issues ==

  • Fixed issue with cm-cloud-storage-setup when using us-east-1 region
  • Prevent cloud instance termination when cloud director is down from being listed as UP+terminated
  • Fixed starting spot instances after a no-capacity in availability zone scenario occurs
  • Unfulfilled spot instance requests stay in PENDING state until fulfilled or terminated
  • Store availability zones for networks created by COD or manually, which enables AutoScaler to distribute loads between availability zones in COD deployments

== Kubernetes ==
== New Features ==

  • Added support for NGC token authentication in cm-kubernetes-setup

== Improvements ==

  • Improved the wizard when it should fail earlier than it actually does (incorrect return code checks caused the installer to confusingly fail at later stages)
  • Kubernetes wizard errors will now show more context information where possible
  • Increased timeouts for kubeadm init and clusterctl init operations to effectively handle slow connections

== Fixed Issues ==

  • Add user wizard will use BCM user name and not commonName

== Workload Management ==
== New Features ==

  • Added enroot and enroot+caps packages

== Fixed Issues ==

  • Update AWS spot instances state in Slurm when they are terminated outside of BCM

== Container Engines ==
== Improvements ==

  • Improved internal IP detection logic for etcd (similarly to internal IP detection for Kubernetes Calico and Flannel)

== Monitoring ==

== New Features ==

  • Added Prometheus /rules and /alert and /alertmanagers end points
  • Added operstate metrics (operational state i.e., UP / DOWN ) via cm-lite-daemon for Cumulus switches

== Improvements ==

  • Display K/M/G in cmsh for consolidated averages when no unit is set for a metric

== Fixed Issues ==

  • Added support to run healthcheck with storcli software next to megacli software

== Cluster on Demand ==
== Improvements ==

  • Improved the display of the EULA when running from docker image
  • Allow CMDaemon to work with cluster-on-demand cluster spanning multiple regions (requires manual setup)

== Base View==
== Improvements ==

  • Provide notifications in Base View if BCM package updates are available
  • Visualize licensed GPU used and available in Base View