Rtx6000 not recognized by slurm-mig-discovery

hello,

[probably not the correct category but the “other tools” seems precluded to me]

we have a slurm cluster with a previous node with A100-80G cards managed via GitHub - NVIDIA/mig-parted: MIG Partition Editor for NVIDIA GPUs and integrated in slurm via nvidia / hpc / slurm-mig-discovery · GitLab

same process however does not work for a new server with RTX PRO 6000 Blackwell Server Edition cards, mig partitions are correctly enumerated by nvidia-smi but the discovery tool errors out with
GPU count 4
Error in nvmlDeviceGetName()

is there a new version of the tool available? are we missing something?

regards

1 Like

We are troubleshooting the same problem. I played around with mig.c for a bit and found the code that nvmlDeviceGetName() returned was NVML_INSUFFICIENT_SIZE (7). Increasing the name[32] buffer to 64 and recompiling seems to have worked for me…

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.