Using Python's distributed computation libraries for DNN production engine cluster management

isaaclee2313 · February 11, 2019, 10:25am

I’ve looked into both sets: libraries like Slurm and Torque that are recommended by Nvidai and another set of easier-to-use python libraries like Celery.

A little about the cluster I need to manage: it is small ( tens of nodes for now and few hundred at most in the future ). Mostly GPU intensive DNN computations.

Question:
How would you compare the pros and cons of Slurm vs Celery ( especially on: 1. ease of use, 2. performance, 3. scalability to hundreds of nodes, 4. how much you can fine tune )?