Using HPC-X on Cray Slingshot

Hi,

I am trying to run my code on the Delta-AI system at NCSA.

I can run the code on up to all 4 GH200 GPUs on a single node, but when I try to run across multiple nodes, I get a series of UCX errors (“connection refused”).

I have tried numerous environment variable to no avail.

The system admins have told me they do not think that the nvhpc compiler that I installed locally from nvidia has built-in support for the slingshot network.

Is this correct? Is there some special ENV variables I can set to have the HPC-X in 24.11 work on a Cray Slingshot network?

– Ron

Hi Ron,

I double checked with some folks, but confirmed that the HPC-X we ship is only available for Mellanox interconnects. You or the NCSA folks will need to see if they can build you a NVHPC enabled MPI that works with Slingshoot.

-Mat

Thanks for the info!

How does it work on a local system or within a node?

– Ron

The interconnect is only needed when going across the network.

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.