I have a cluster with 4 gpus nodes where I use slurm to run jobs. I already have cuda working in each node. My problem is where I want run cuda programs from main node.
I can login in a node using ssh or by slurm srun comand.
Using SSH, CUDA works good like normal user and root root.
But if I enter in a node opening a session using “srun -w node1 --pty bash” with a user, I get this:
./deviceQuery Starting... CUDA Device Query (Runtime API) version (CUDART static linking) cudaGetDeviceCount returned 30 -> unknown error Result = FAIL
Also if I get the same error using “srun -w node1 ./deviceQuery”. This form is equivalent but without entering in the node.