Using Python's distributed computation libraries for DNN production engine cluster management

I’ve looked into both sets: libraries like Slurm and Torque that are recommended by Nvidai and another set of easier-to-use python libraries like Celery.

A little about the cluster I need to manage: it is small ( tens of nodes for now and few hundred at most in the future ). Mostly GPU intensive DNN computations.

How would you compare the pros and cons of Slurm vs Celery ( especially on: 1. ease of use, 2. performance, 3. scalability to hundreds of nodes, 4. how much you can fine tune )?