Hi,
I am trying to run my code on the Delta-AI system at NCSA.
I can run the code on up to all 4 GH200 GPUs on a single node, but when I try to run across multiple nodes, I get a series of UCX errors (“connection refused”).
I have tried numerous environment variable to no avail.
The system admins have told me they do not think that the nvhpc compiler that I installed locally from nvidia has built-in support for the slingshot network.
Is this correct? Is there some special ENV variables I can set to have the HPC-X in 24.11 work on a Cray Slingshot network?
– Ron