CUDA and driver installation on a small cluster

I have a very, very small cluster with one head node and one compute node. The compute node has
two V100 GPUs inside but the head node has no graphics card at all. I’m provisioning my system using Warewulf in a stateless compute node configuration. I’ve been trying to find a guide on how to
install the driver and the CUDA toolkit for a cluster situation but without much success. I have
found a mention in the documentation about 2 packages that seem to be related to cluster installation
but I cannot find those packages up for download nor the mentioned associated README file

  • cuda-cluster-runtime-9-2, cuda-cluster-devel-9-2:

https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#cluster

Does anyone know where I could find these packages? Is there official support for cluster installation?

I would like to be able to launch GPU relates jobs from the head node to the compute node

  • both C++ and Python, I’m using slurm as a job scheduler. Does anyone have any recipe on
    how to proceed to the installation? Any hints would be much appreciated.

Thank you

Bump

Anybody? Really, seriously, nobody from NVIDIA knows about these packages? They are mentioned in the documentation …

I don’t need somebody to hold my hand in installing these, just tell me where to get the packages from.

Hello,

I have forwarded your questions to the CUDA team. Please stay tuned for a reply.

Thanks,
Tom

The cluster packages are available on the download page in the tarball called “cluster(local)” – see screenshot below.

Once you extract the tarball, you should be able to find the packages: cuda-cluster-runtime-10-0_10.0.130-1 and cuda-cluster-devel-10-0_10.0.130-1

Is there official support for cluster installation? Does anyone have any recipe on how to proceed to the installation?
Yes – we officially support cluster packages. More details are available in the README.

Hope this helps.

Cheers,
Tom

Thank you for the reply and the clarifications.

For the version 9.2 of CUDA, unfortunately the package is not available for CentOS 7 which I am using for my cluster installation. It’s only available for RHEL 7 so I’m going to try to install that one. The README file inside that archive is not the one I was expected - not really a lot of clarifications as to what those packages do and how to install them in a cluster environment - head node & compute nodes.

I failed to see the supported distributions in the doc though I’m a bit surprised that there is support for Ubuntu and not CentOS(do people install Ubuntu on their clusters?)

(have no idea how to actually link images to this …??)
nvidia_cuda_centos_download.png
nvidia_cuda_documentation.png

I have forwarded your question to the Product Manager.

I am not sure why the images did not load.
What format were these two files?

Thanks,
Tom

Hi tom!, thanks for the info…
when the image will be loaded for centOS 7?