There are several somewhat related activities that could fall under the category of “cluster management”
- monitoring - keeping track of utilization, status, health
- provisioning - setting up or reconfiguring the software install
- work distribution - coordinating the activities of multiple workers, to assign them to one or more jobs, from one or more users
tools like MPI and NCCL fall into the 3rd category, but they also intersect the application development process itself
A tool like LSF provides some support for categories 1 and 3, with (I would say) a heavier emphasis on category 3. A tool like Bright Cluster Manager is largely focused on category 2. Ganglia is largely focused on category 1. (Just to give some examples). NVIDIA provides software building blocks in one form or another to support in varying degrees all 3 categories (as they pertain to GPUs).
I personally wouldn’t attempt to build a load balancer (a category 3 item) by hand. If you have a significant need for a job scheduler, I would at least investigate open-source alternatives such as SLURM or possibly PBS or Torque/Maui
If you are intending to maintain this with a single person who also has a “day job”, you may be tackling too much. I have built simple access control mechanisms using task spooler:
This was sufficient to allow multiple users to use GPUs without crashing horribly into each other, without becoming a full-time job.