Simplifying AI Development with NVIDIA Base Command Platform

jwitsoe · December 6, 2022, 5:00pm

Originally published at: https://developer.nvidia.com/blog/simplifying-ai-development-with-base-command-platform/

The NVIDIA Base Command Platform enables an intuitive, fully featured development experience for AI applications. It was built to serve the needs of the internal NVIDIA research and product development teams. Now, it has become an essential method for accessing on-demand compute resources to train neural network models and execute other accelerated computing experiments. Base Command Platform…

jhandzik · December 6, 2022, 9:18pm

Hi! I’m Joe, one of the authors of this blog - we’ve been working with Base Command Platform internally for a while now, and we’re excited to share it outside of NVIDIA! I’m happy to answer any questions that come up - please give the blog a read and let us know what you think, we’d be happy to hear from anyone interested in Base Command Platform!

chip.maguire · September 9, 2023, 12:38pm

What is the scheduler in Base Command Platform and where does it run? Is the cloud-based Base Command Platform pushing requests to a local slurm (or other manager) running at each of the ACEs?

Based on yesterday’s webinar “Manage Your Multi-Cloud AI Infrastructure on DGX with Base Command Manager” I understand that Bright Cluster Manager is merging with Base Command Manager in release 10 (scheduled for release at the end of this month). In the base command manager that runs on a head node, you can choose different workload managers, such as slurm. However, the head node is a separate computer from, for example, a DGX H100, and the model seems to be that this head node sits between the external network and the compute nodes. It seems that the new NVIDIA Base Command Manager is mostly about cluster management and the ability to start, monitor, and manage jobs on physical nodes and the ability to easily deploy new images of an OS to these physical nodes.

The current “NVIDIA Bright Cluster Manager 9.2” documentation talks about Gigabit Ethernet interfaces on each side of the head node. This seems like a very poor architecture, given the high bandwidth that a DGX H100 can actually achieve (100, 200, and 400 Gbps) per interface. Even with a single A100 it is possible to saturate a 100 Gbps link with a CUDA kernel.

Context: I’m currently trying to understand how to utilize a single DGX H100 in an educational setting where there will be a mix of users, with some wanting to run batch jobs, some running Kubernetes, and some running Jupyter notebooks. Since a lot of these people will be doing AI & data science, NGC containers are really attractive as the way to go forward; so, Base Command Platform seems really attractive. Also, since we have learned that NVIDIA has deprecated VMs on the DGX H100, even these will need to move to containers. So, I’m looking for examples of best practices that might be applicable.

jhandzik · September 13, 2023, 6:08pm

Hi Chip!

Base Command Platform leverages a custom scheduler that happens to be integrated with Kubernetes, but with complex requirements abstracted away to enable straightforward interactions and comprehension for administrators and users alike.

Any cloud-based components that are not directly involved in the core Kubernetes control plane and compute resources simply provide integration with the NGC portal - so in this case, the interactions are pushed to a Kubernetes cluster, but you have the right idea. This cluster could be in a customer datacenter, or in a cloud service provider (as is the case with DGX Cloud, for example).

You have the general idea correct regarding Base Command Manager - the head node facilitates external network communication from the systems and services it deploys (but importantly, not inter-node communication for compute workloads). Base Command Manager provisions a cluster and some basic core services. More complex software such as Base Command Platform can then be deployed on top of that.

With regard to the existing documentation referring to Gigabit Ethernet - I’d need to know exactly what part of the documentation you are referring to in order to provide a high confidence answer. I assume the area of the documentation you are referring to is either describing a network that has no impact on the overall performance of a DGX cluster, or is only describing minimum configuration requirements. Existing BasePOD and SuperPOD requirements for inter-node communication leverage InfiniBand or RoCE Ethernet fabrics, at bandwidth specs that match what you are noting as possible in a DGX H100. And as mentioned above, any non-compute systems are not in the middle of that fabric, so they do not become a bottleneck.

Base Command Platform is indeed an excellent fit for a shared infrastructure use case with many different use cases and workloads. However, a single DGX H100 will still incur the same control plane requirements that are present for larger scale designs - generally, we suggest at least four DGX systems (A100 or H100) for such a deployment. Also, given that you are describing a deployment that contains a single H100, what network bottlenecks are your primary concern? On a single system, the only network interactions I can see as being present would be between the user and the system, or the system and a network-attached storage of some kind. Let me know if there is something I haven’t thought of!

chip.maguire · September 15, 2023, 1:34pm

Thanks for your reply!

So one has to have a headend running Base Command Manager and then deploy compute instances with Base Command Platform or is it possible to deploy Base Command Platform directly on a DGX H100 without Base Command Manager? In this latter case, one would use the customer scheduler you refer to.

When looking at one of the documents about Base Command Manager, it described the headend as being connected to an external network with a Gigabit Ethernet and another Gigabit Ethernet connection to potentially a Gigabit Ethernet switch and then to the compute nodes. In contrast, we have a 100 Gbps Ethernet connection from the DGX H100 to the campus network, and (hopefully) this next week, students will begin to receive/send data for inferencing (directly to/from the GPU using RDMA) via the two 400 Gbps Ethernet interfaces (that you think of as for attaching storage devices). [We are still trying to work out the cabling to do this same thing via the Ceder Fever 400 Gbps interfaces (using RoCE).] Thus, we do not think of the Cedar Fever NICs as just for inter-node communication but rather a way of serving inferencing requests at very high speed and with an aggregate high bandwidth. Research colleagues have shown in RIBOSOME (https://www.usenix.org/system/files/nsdi23-scazzariello.pdf) the ability to do Tbps packet processing.

Sadly, all I have to work with at the moment is a single DGX H100 although I can probably add a headend node, I don’t see this node doing much - since I do not think repeatedly deploying different configurations on the DGX H100 will make users happy (unless this is part of some daily/weekly schedule).

Thanks in advance for your help and understanding of questions from someone who is just getting started at trying to exploit the DGX H100 node that we have.

chip.maguire · September 19, 2023, 1:08pm

A minor update: the “storage” NICs are running MCX755206AS-692 firmware, so they are each: NVIDIA ConnectX-7 VPI adapter card, 400Gb/s IB and 200GbE, dual-port QSFP, PCIe 5.0 x16, dual slot, secure boot, no crypto, for Nvidia DGX storage
and will only support 200 Gbps Ethernet. As Ethernet cards, they do in fact work at 200 Gbps.

jhandzik · September 25, 2023, 9:36pm

Hey Chip - sorry for the delay. You caught me between a long weekend and presentation prep. :)

It is not possible today to deploy Base Command Platform directly on a DGX H100 - Base Command Manager lays the groundwork for BCP. It’s an interesting concept, but even if one shrunk the control plane of BCM and BCP to fit on a single H100 system, a high performance filesystem is still part of the requirements for a deployment. It is always a challenge to build a scalable system that also takes into account smaller deployments - I appreciate getting some feedback around your use case and environment!

The network I suspect you are referring to is the externalnet connection between the head node and the wider network - it’s not strictly necessary to do all communication to the DGX systems in a BasePOD or SuperPOD deployment through that external network. There is a high performance internalnet that can be leveraged instead, so long as the internalnet is routable between the client and the DGX system. Externalnet is intended more as a buffer between the local cluster and the outside world (the internet). So, in your case, as long as the internalnet was routable from the campus network, you’re not limited to the lower performance interface for all system interaction.

I hear you with respect to your configuration and use case not requiring a head node - you’re right, you aren’t going to be reimaging that system regularly. Something that Base Command Manager does quite well though is enabling safe changes and rollback mechanisms as system updates are applied. With Base Command Manager, a user can clone a DGX system’s current OS configuration as a backup, then make some modifications (upgrading DGX OS versions or software package versions, for example). If a problem is encountered, it is very straightforward to “undo” those upgrades with Base Command Manager by either rebooting the system if the change was not saved in the Base Command Manager softwareimage, or by reverting to the backed up softwareimage if the problem takes longer to manifest.

At multi-system scale, managing system consistency can become a big challenge - BCM takes the risk of system configuration drift away by managing a consistent softwareimage that can be leveraged across a category of systems - scalably from a handful of systems all the way to hundreds of them. So it’s not so much about deploying different configurations as it is about the ongoing maintenance and management of the software image deployed on the DGX - even if it isn’t changing often for extended periods of time. So I do think that, even if you cannot acquire enough systems to make a Kubernetes or BCP control plane possible, you’d find some value in using BCM (even just a head node).