A couple weeks back I made a post asking about slurm and it seems to have gotten lost in all the AI buzz. I suspect us HPC people are a minority here so it may be wise to keep everything aggregated into a single thread.
Some possible discussion topics:
System provisioning and configuration - what configurations have you found to work well on DGX Spark/GB10. Let’s keep this oriented towards HPC style setups.
Scientific Applications - has anyone tried other scientific applications yet? Gromacs, OpenFOAM, WRF, you name it.
Benchmark Discussion - it could be interesting to compare and discuss results, like OSU microbenchmarks, Intel MPI benchmarks, custom application benchmark scripts.
I’ll warm this up with what I’ve setup so far. I followed Nvidia’s DGX “Deepops” tutorial for setting up an all-in-one slurm system. Source:
The all-in-one setup runs the login, ondemand, compute, and slurm master all on one node. The GPU is hidden by default via cgroups, and has to be access through a job allocation. The playbooks are very nice in that they install docker, singularity, and enroot/pyxis (later is my favorite container combo). I’ve found the containers to be the most efficient way of testing GenAI like TensorRT-LLM and vLLM - mpirun tensorrt works very well with enroot/pyxis.
The ansible playbook itself required a bit of patching, but after an afternoon of doing so I got it successfully running on a single spark. Next step is to try setting up a second spark with directly connected Infiniband.
I got lucky, turns out I have a spare HDR switch laying around and am ordering 20 of these little nodes. It’ll be very interesting to see how the sparks work in a more typical cluster topology. Apparently the ConnectX card may require some special configuration if used in a switch topology:
Adding my initial slurm query that got lost in the abyss.
CPU topology is likely to be an important slurm configuration. Numa reports one socket with Cortex-X925 (performance) and another socket with Cortex-A725 (green efficiency). I wonder what’s the effective CPU binding here?
When running MPI jobs on other systems, it’s been found that the efficiency cores can bottleneck the performance ones and many times is best to disable the efficiency cores… however that halves the total number of CPUs on the Spark.
In an all-in-one single node setup that works well as the ancillary services can hog up the efficiency cores and aren’t too performance sensitive.
On conventional x86/PCIe systems only a few CPUs are sufficient to get full GPU utilization, I wonder if the NVLink changes this dynamic?
In a full cluster configuration with dedicated compute nodes halving total CPUs to only performance cores might have more noticeable effects on performance as half the CPUs are sitting idle with nothing to do! Contrary to that, I’m not aware of any applications or MPI libraries that optimize for Performance and Efficiency cores.
Maybe offer two slurm partitions, one for all CPU and another for just the Performance cores.
Indeed the efficient cores run at a different speed leaving the faster cores in a wait state. However more cores may end up being an advantage depending on your application. Benchmarking or profiling your software may help.
lstopo --of txt may be useful
You can limit jobs to specific cores with Slurm option –cpu-bind=map_cpu once you figure out which core is which.
I’m a former CFD person and now an AI person. Sorry there’s been little activity on your HPC questions. Honestly, I haven’t seen many actual HPC people in these forums.
Unfortunately, I’m jumping on to say hi and raise profile of this topic to hopefully get it some interest, but I don’t really have anything solid to add to this right now. (I’ve been heavily focused on the AI side of things. Running openfoam is on the list but near the bottom.)
As @bugsareyummy said, I don’t think an actual IB switch is going to work – it has to be Ethernet (it uses RoCE).
It’s tough to say what partitioning scheme you should use without knowing more about what specifically you’re doing… but it won’t take a lot of CPU power to get you to near 100% GPU utilization. You might need more CPU for driving networking than GPU if you’re looking to scale up a lot. I would basically lean on over/undersubscription (e.g. reserve 8 cores for 4 core job; feels like it should be oversubscription, but that’s typically used when putting too many consumers on resources) as your first pass and see how well/poorly the kernel scheduler does… and then transition to pinning cores based on a few task-specific evals.