mult-GPU monitoring system

Nvidia has a page on cluster management , but I am not sure which one I should go with. Which would be the go-to choice? It doesn’t matter if it is not open source. I am a complete new-B to this field, so please feel free to share any advice on general multi-gpu production server management.


You might want to start by making a list of features you need, determine how big a cluster you envision running (no need to aim cannons at gnats), and your basic cost constraints. Check whether any of the vendors will let you test-drive their product. Involve the person who will be administrating the system in the selection process.

I worked for an early customer of Platform Computing’s LSF (around 1994), and served as LSF administrator for a cluster of close to one hundred machines at the time. The vendor was Toronto-based startup in those days. I had selected LSF from among several competitors (including at least one open-source candidate). I spent about two months studying features vs requirements and inquiring with vendor representatives and in the end wound up with a shortlist of two vendors. I did trial installations of those two.

I ran that LSF cluster for about a year and generally was happy with my choice and Platform Computing tech support. In those days you would sometimes get to talk to their development engineers directly. Since then I have used LSF at multiple other companies and never had any complaints. The amount of configurability it provided is immense. However, the last time I looked at pricing (now quite a number of years ago), I got a major case of sticker shock. It seemed that its price had grown with the size of the company selling it.

These days LSF is owned by IBM. Not sure what implications (if any) that has. Look carefully before you jump.

Hi njuffa,

Thanks for your helpful insight.

I was considering just building the management system by hand: using MPI and NCCL. That would mean I need to take care of load balancing by hand. Do you think this may be an unrealistic attempt? ( FYI, I am a Junior in CS and I am the OLNY person with any knowledge of systems. so I am essentially the company’s system administrator, unfortunately ).

There are several somewhat related activities that could fall under the category of “cluster management”

  1. monitoring - keeping track of utilization, status, health
  2. provisioning - setting up or reconfiguring the software install
  3. work distribution - coordinating the activities of multiple workers, to assign them to one or more jobs, from one or more users

tools like MPI and NCCL fall into the 3rd category, but they also intersect the application development process itself

A tool like LSF provides some support for categories 1 and 3, with (I would say) a heavier emphasis on category 3. A tool like Bright Cluster Manager is largely focused on category 2. Ganglia is largely focused on category 1. (Just to give some examples). NVIDIA provides software building blocks in one form or another to support in varying degrees all 3 categories (as they pertain to GPUs).

I personally wouldn’t attempt to build a load balancer (a category 3 item) by hand. If you have a significant need for a job scheduler, I would at least investigate open-source alternatives such as SLURM or possibly PBS or Torque/Maui

If you are intending to maintain this with a single person who also has a “day job”, you may be tackling too much. I have built simple access control mechanisms using task spooler:

This was sufficient to allow multiple users to use GPUs without crashing horribly into each other, without becoming a full-time job.

Thanks a lot! I will look into the mentioned programs.

I am not too certain what you mean by: “2. provisioning - setting up or reconfiguring the software install”. Could you please give some examples? I am thinking: updating weights loaded per GPU would be one case of category 2?

No that’s not it. You seem to have application development smeared into cluster management. While there may be some cross pollination, for me at least they are quite separate concepts. None of the tools for cluster management are going to provide support for a parameter server in a deep learning context. That is still firmly in the camp of the framework or application software (although systems like Kubernetes are probably starting to blur this boundary also).

Many questions are answerable with a careful google search. I encourage you to try it. The research skills you build will be valuable to you, in my opinion.

I already mentioned that bright cluster manager was an example of a category 2 tool. They have a page which defines provisioning pretty well, in my opinion:

Okay, thanks a lot :) Really appreciate your help, as always.