Slurm install

Greetings -

I am getting this on

sacctmgr add cluster nvda

sacctmgr: error: resolve_ctls_from_dns_srv: res_nsearch error: No error
sacctmgr: error: fetch_config: DNS SRV lookup failed
sacctmgr: error: _establish_config_source: failed to fetch config
sacctmgr: fatal: Could not establish a configuration source

I believe that it is related to the DNS config off the fios router; however,
resolv comes back correctly. For some reason, I can not set the domainname
for other than local.

Thoughts ?

Sorry that we don’t have experience in slurm. I will let other forum users to share their experience here.

Were you able to find a solution for this problem? I am experiencing the same thing

Slurm 19 seems to work just fine; however, I have not revisited the gres.conf gpu config

It wasn’t working for me when I installed it using make install as root (sudo -i, ./configure, make install, etc) on version 20.02.2

I just reinstalled it using RPM and as a regular user (i.e. sudo yum localinstall …), and it works as intended!

Which version ? 19 or 20 ? If 20, then how did you get around dynamic config ?

Thanks
Chris

I’ve never heard of dynamic config before. I believe that our network setup issues static ips

I meant slurm 20 which supports srv records in dns which allows for dynamic config

I just have the /etc/slurm/slurm.conf file on the login node. I don’t know if this answers your question.

Could you dump your slurm.conf file ?

Thanks
Chris

I am still in the testing phase, and the final configuration will not look like this when done. For example I am planning on having the slurmdbd be on the same server as the login node, not the control node as it’s currently setup. Also, the SrunPortRange will be much larger once we go live since the number of users will be larger than the number of people testing it (currently two).

slurm.conf file generated by configurator.html.

Put this file on all nodes of your cluster.

See the slurm.conf man page for more information.

SlurmctldHost=tesla
SlurmctldHost=turing
MpiDefault=none
ProctrackType=proctrack/cgroup
ReturnToService=1
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=/var/run/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurmd
SlurmUser=slurm
StateSaveLocation=/var/nfsshare/slurmctld
SrunPortRange=61001-61200
SwitchType=switch/none
TaskPlugin=task/cgroup

TIMERS

InactiveLimit=0
KillWait=30
MinJobAge=300
SlurmctldTimeout=180
SlurmdTimeout=300
Waittime=0

SCHEDULING

SchedulerType=sched/backfill
SelectType=select/cons_tres
SelectTypeParameters=CR_Core

LOGGING AND ACCOUNTING

AccountingStorageHost=tesla
AccountingStorageLoc=slurm_acct_db
AccountingStorageType=accounting_storage/slurmdbd
AccountingStorageUser=root
AccountingStorageEnforce=limits
AccountingStoreJobComment=YES
ClusterName=HPCCluster
JobCompLoc=/var/log/slurm/slurm_jobcomp.log
JobCompType=jobcomp/filetxt
JobAcctGatherType=jobacct_gather/linux
SlurmctldLogFile=/var/spool/slurm/slurmctldLog
SlurmdLogFile=/var/spool/slurm/slurmdLog

COMPUTE NODES

NodeName=turing RealMemory=193292 Sockets=1 CoresPerSocket=64 ThreadsPerCore=4 State=UNKNOWN
NodeName=newton RealMemory=193292 Sockets=1 CoresPerSocket=64 ThreadsPerCore=4 State=UNKNOWN
PartitionName=KNL Nodes=ALL Default=YES MaxTime=INFINITE State=UP

Greetings,
I have 3 jetson nanos running with slurm.
I can process 1,000 of tasks without issue.
I am looking for a power splitter and dedicated
switch.

Hello together,
I want to use the JETSON for GPU-Nodes too, but if I want to compile “hwloc” or “openMPI” the system hangs… I use the Original Image File from NVIDIA and disable the X-Server.

So, if there something additional to know about using JETSON for GPU-Node?
Which packet must be installed to using it as GPU? Any other helps?

Best regards…

For now, I am working off a base slurm no gpu, but the python 3 cuda compile worked. So, slurm can run the jobs but not aware of gpu . Looking at powering like 10+ eventually .

Chris

Hey Chris,
thank you very much for your answer.
Well I want to install SLURM with GPU and it seems, that this should work. But how :-)

If I find out how, i can tell you that.
Best regards

I don’t think the nvml module exists for Tegra .

Good luck

One thing to check is the permissions for /etc/slurm where the slurm.conf file is usually located. If these are not readable by slurm you will get this error. In my case have
drwx-wx–x. 2 root root 4096 Jan 11 20:23 slurm