Hi Everyone, I am trying to use NIM on a DGX Cloud Slurm Cluster. Earlier I was able to run it for llama 3.1 70b for version 1.2.1. However, I have noticed that for any version higher than it, it fails to detect the GPU. I also tried nvidia-smi, but it shows that the command is not found. I have manually checked via ssh on the gpu compute node that the drivers are present and am able to use nvidia-smi there as well but inside the slurm job, its not able to detect it. I also checked with other Non-NIM images such as Pytorch and Nvidia Nemo and it works with the latest images but not for NIM. Attaching the Slurm command used to launch the job, could you please help me? Command: srun -N 1 --pty
–container-image nvcr.io/nim/meta/llama-3.1-70b-instruct:latest
–container-mounts “/lustre/fs0/scratch/rahulsingh55/”
–container-mounts /cm/shared
–gpus 4
–job-name “my-job:interactive”
–partition defq
–mpi=pmix
bash
Hey @rahulsingh55,
We can take a look at this for sure. As a part of DGX Cloud you should have a dedicated TAM and access to enterprise support. Can you reach out to your TAM and work with them to file an enterprise support case for this?
Thanks so much,
Sophie