I’m trying to install the NVIDIA HPC SDK as a “network” installation. I have a mix of machines from CentOS7, RHEL8, and (eventually) RHEL9. They run a mix of CUDA 11.0, 11.2, 12.1. (I’m ignoring the 11.2 ones because it doesn’t seem to be included in the HPC SDK tar file for multiple CUDA versions).
I’m confused by the “network” install and would like to understand what I’m doing wrong.
In the following discussion :
/public/opt/nvhpc : a directory mounted via NFS to all hosts
/scr/nvhpc : a directory local to each host
I download the software :
https://developer.download.nvidia.com/hpc-sdk/23.5/nvhpc_2023_235_Linux_x86_64_cuda_multi.tar.gz
and the installation documentation :
https://docs.nvidia.com/hpc-sdk/pdf/hpc-sdk235install.pdf
Per the instructions I setup to do a silent install:
[gpu2]$ cd /public/src/nvhpc_2023_235_Linux_x86_64_cuda_multi/
[gpu2]$ setenv NET /public/opt/nvhpc
[gpu2]$ setenv LCL /scr/nvhpc
[gpu2]$ setenv REL 23.5
[gpu2]$
[gpu2]$ setenv NVHPC_SILENT "true"
[gpu2]$ setenv NVHPC_INSTALL_DIR ${NET}
[gpu2]$ setenv NVHPC_INSTALL_TYPE "network"
[gpu2]$ setenv NVHPC_INSTALL_LOCAL_DIR ${LCL}
[gpu2]$ setenv NVARCH Linux_x86_64
[gpu2]$ setenv NVHPC_DEFAULT_CUDA 12.1
[gpu2]$ ./install
generating environment modules for NV HPC SDK 23.5 ... done.
Looking at what’s created in the NVHPC_INSTALL_DIR and NVHPC_INSTALL_LOCAL_DIR
[gpu2]$ tree -L 4 $NVHPC_INSTALL_DIR
/public/opt/nvhpc
├── Linux_x86_64
│ ├── 2023 -> 23.5
│ └── 23.5
│ ├── cmake
│ │ ├── NVHPCConfig.cmake
│ │ └── NVHPCConfigVersion.cmake
│ ├── comm_libs
│ │ ├── 11.0
│ │ ├── 11.8
│ │ ├── 12.1
│ │ ├── hpcx
│ │ ├── mpi -> openmpi/openmpi-3.1.5
│ │ ├── nccl -> 12.1/nccl
│ │ ├── nvshmem -> 12.1/nvshmem
│ │ ├── openmpi
│ │ └── openmpi4
│ ├── compilers
│ │ ├── bin
│ │ ├── etc
│ │ ├── extras
│ │ ├── include
│ │ ├── include_acc
│ │ ├── include_man
│ │ ├── include-stdexec
│ │ ├── include-stdpar
│ │ ├── lib
│ │ ├── license
│ │ ├── man
│ │ ├── share
│ │ └── src
│ ├── cuda
│ │ ├── 11.0
│ │ ├── 11.8
│ │ ├── 12.1
│ │ ├── bin -> 12.1/bin
│ │ ├── include -> 12.1/include
│ │ ├── lib64 -> 12.1/lib64
│ │ └── nvvm -> 12.1/nvvm
│ ├── examples
│ │ ├── AutoPar
│ │ ├── CUDA-Fortran
│ │ ├── CUDA-Libraries
│ │ ├── F2003
│ │ ├── MPI
│ │ ├── NVLAmath
│ │ ├── OpenACC
│ │ ├── OpenMP
│ │ ├── README
│ │ └── stdpar
│ ├── localrc.gpu2
│ ├── localrc.gpu2-lock
│ ├── math_libs
│ │ ├── 11.0
│ │ ├── 11.8
│ │ ├── 12.1
│ │ ├── include -> 12.1/include
│ │ └── lib64 -> 12.1/lib64
│ ├── profilers
│ │ ├── Nsight_Compute
│ │ └── Nsight_Systems
│ └── REDIST
│ ├── comm_libs
│ ├── compilers
│ ├── cuda
│ └── math_libs
└── modulefiles
├── nvhpc
│ └── 23.5
├── nvhpc-byo-compiler
│ └── 23.5
├── nvhpc-hpcx
│ └── 23.5
├── nvhpc-hpcx-cuda11
│ └── 23.5
├── nvhpc-hpcx-cuda12
│ └── 23.5
└── nvhpc-nompi
└── 23.5
67 directories, 11 files
[gpu2]$ tree -L 4 $NVHPC_INSTALL_LOCAL_DIR
/scr/nvhpc
0 directories, 0 files
Which doesn’t look right to me. It certainly didn’t create anything in the “local” directory for the machine I did the install on though it did create the localrc.gpu2 file.
The documentation then says go to other machines and do the following :
If your installation base directory is /opt/nvidia/hpc_sdk and /usr/nvidia/shared/23.5 is the common local directory, then run the following commands on each system on the network.
/opt/nvidia/hpc_sdk/$NVARCH/23.5/compilers/bin/makelocalrc -x /opt/nvidia/hpc_sdk/$NVARCH/23.5 -net /usr/nvidia/shared/23.5
These commands create a system-dependent file localrc.machinename in the /opt/nvidia/hpc_sdk/$NVARCH/23.5/compilers/bin directory. The commands also create the following three directories containing libraries and shared objects specific to the operating system and system libraries on that machine:
/usr/nvidia/shared/23.5/lib
/usr/nvidia/shared/23.5/liblf
/usr/nvidia/shared/23.5/lib64
So going to one of my other machines and substituting my directory locations this becomes :
[gpu1]$ cd /public/src/nvhpc_2023_235_Linux_x86_64_cuda_multi/
[gpu1]$ setenv NET /public/opt/nvhpc
[gpu1]$ setenv LCL /scr/nvhpc
[gpu1]$ setenv REL 23.5
[gpu1]$
[gpu1]$ setenv NVHPC_SILENT "true"
[gpu1]$ setenv NVHPC_INSTALL_DIR ${NET}
[gpu1]$ setenv NVHPC_INSTALL_TYPE "network"
[gpu1]$ setenv NVHPC_INSTALL_LOCAL_DIR ${LCL}
[gpu1]$ setenv NVARCH Linux_x86_64
[gpu1]$ setenv NVHPC_DEFAULT_CUDA 12.1
[gpu1]$ $NET/$NVARCH/23.5/compilers/bin/makelocalrc -x $NET/$NVARCH/23.5 -net $LCL
/public/opt/nvhpc/Linux_x86_64/23.5/compilers/bin/makelocalrc: line 146: /public/opt/nvhpc/Linux_x86_64/23.5/nvaccelinfo: No such file or directory
find: '/public/opt/nvhpc/Linux_x86_64/23.5/../../cuda': No such file or directory
/public/opt/nvhpc/Linux_x86_64/23.5/compilers/bin/makelocalrc: line 158: bundled_cuda: bad array subscript
But that fails.
Best as I can tell from reading the code, the documenatation is wrong and the incantation should be:
[gpu1]$ $NET/$NVARCH/23.5/compilers/bin/makelocalrc $NET/$NVARCH/23.5/compilers/bin -x -net $LCL
That doesn’t choke, and the directories from the viewpoint of the second machine are:
[gpu1]$ tree -L 4 $NET
/public/opt/nvhpc
├── Linux_x86_64
│ ├── 2023 -> 23.5
│ └── 23.5
│ ├── cmake
│ │ ├── NVHPCConfig.cmake
│ │ └── NVHPCConfigVersion.cmake
│ ├── comm_libs
│ │ ├── 11.0
│ │ ├── 11.8
│ │ ├── 12.1
│ │ ├── hpcx
│ │ ├── mpi -> openmpi/openmpi-3.1.5
│ │ ├── nccl -> 12.1/nccl
│ │ ├── nvshmem -> 12.1/nvshmem
│ │ ├── openmpi
│ │ └── openmpi4
│ ├── compilers
│ │ ├── bin
│ │ ├── etc
│ │ ├── extras
│ │ ├── include
│ │ ├── include_acc
│ │ ├── include_man
│ │ ├── include-stdexec
│ │ ├── include-stdpar
│ │ ├── lib
│ │ ├── license
│ │ ├── man
│ │ ├── share
│ │ └── src
│ ├── cuda
│ │ ├── 11.0
│ │ ├── 11.8
│ │ ├── 12.1
│ │ ├── bin -> 12.1/bin
│ │ ├── include -> 12.1/include
│ │ ├── lib64 -> 12.1/lib64
│ │ └── nvvm -> 12.1/nvvm
│ ├── examples
│ │ ├── AutoPar
│ │ ├── CUDA-Fortran
│ │ ├── CUDA-Libraries
│ │ ├── F2003
│ │ ├── MPI
│ │ ├── NVLAmath
│ │ ├── OpenACC
│ │ ├── OpenMP
│ │ ├── README
│ │ └── stdpar
│ ├── localrc.gpu1
│ ├── localrc.gpu1-lock
│ ├── localrc.gpu2
│ ├── localrc.gpu2-lock
│ ├── math_libs
│ │ ├── 11.0
│ │ ├── 11.8
│ │ ├── 12.1
│ │ ├── include -> 12.1/include
│ │ └── lib64 -> 12.1/lib64
│ ├── profilers
│ │ ├── Nsight_Compute
│ │ └── Nsight_Systems
│ └── REDIST
│ ├── comm_libs
│ ├── compilers
│ ├── cuda
│ └── math_libs
└── modulefiles
├── nvhpc
│ └── 23.5
├── nvhpc-byo-compiler
│ └── 23.5
├── nvhpc-hpcx
│ └── 23.5
├── nvhpc-hpcx-cuda11
│ └── 23.5
├── nvhpc-hpcx-cuda12
│ └── 23.5
└── nvhpc-nompi
└── 23.5
67 directories, 13 files
[gpu1]$ tree -L 4 $LCL
/scr/nvhpc [error opening dir]
0 directories, 0 files
So while makelocalrc did create a new localrc.gpu1, it didn’t create a local directory and certainly didn’t create the lib files that the documentation says should be there. (Poking around in makelocalrc it doesn’t look like the script ever does anything with locdir set by the -net argument.)
So… I then tried doing a non-silent install to see if that’s any different.
[gpu2]$ ./install
Welcome to the NVIDIA HPC SDK Linux installer!
You are installing NVIDIA HPC SDK 2023 version 23.5 for Linux_x86_64.
Please note that all Trademarks and Marks are the properties
of their respective owners.
Press enter to continue...
A network installation will save disk space by having only one copy of the
compilers and most of the libraries for all compilers on the network, and
the main installation needs to be done once for all systems on the network.
1 Single system install
2 Network install
Please choose install option:
2
Please specify the directory path under which the software will be installed.
The default directory is /opt/nvidia/hpc_sdk, but you may install anywhere you wish,
assuming you have permission to do so.
Installation directory? [/opt/nvidia/hpc_sdk]
/public/opt/nvhpc
Common local directory on all hosts for shared objects? [/public/opt/nvhpc/Linux_x86_64/23.5/share_objects]
/scr/nvhpc
Note: directory /scr/nvhpc was created.
Note: directory /public/opt/nvhpc was created.
Installing NVIDIA HPC SDK version 23.5 into /public/opt/nvhpc
Making symbolic link in /public/opt/nvhpc/Linux_x86_64
generating environment modules for NV HPC SDK 23.5 ... done.
Installation complete.
Please run add_network_host to create host specific localrc files:
/public/opt/nvhpc/Linux_x86_64/23.5/compilers/bin/localrc.$host
on all other hosts you wish to run NVIDIA HPC SDK compilers.
For 64-bit NVIDIA HPC SDK compilers on 64-bit Linux systems, do the following:
/public/opt/nvhpc/Linux_x86_64/23.5/compilers/bin/add_network_host
HPC SDK successfully installed into /public/opt/nvhpc
.
.
.
which seems to install the same way as my “silent” install in the two directories.
[gpu2]$ tree -L 4 $NET
/public/opt/nvhpc
├── Linux_x86_64
│ ├── 2023 -> 23.5
│ └── 23.5
│ ├── cmake
│ │ ├── NVHPCConfig.cmake
│ │ └── NVHPCConfigVersion.cmake
│ ├── comm_libs
│ │ ├── 11.0
│ │ ├── 11.8
│ │ ├── 12.1
│ │ ├── hpcx
│ │ ├── mpi -> openmpi/openmpi-3.1.5
│ │ ├── nccl -> 12.1/nccl
│ │ ├── nvshmem -> 12.1/nvshmem
│ │ ├── openmpi
│ │ └── openmpi4
│ ├── compilers
│ │ ├── bin
│ │ ├── etc
│ │ ├── extras
│ │ ├── include
│ │ ├── include_acc
│ │ ├── include_man
│ │ ├── include-stdexec
│ │ ├── include-stdpar
│ │ ├── lib
│ │ ├── license
│ │ ├── man
│ │ ├── share
│ │ └── src
│ ├── cuda
│ │ ├── 11.0
│ │ ├── 11.8
│ │ ├── 12.1
│ │ ├── bin -> 12.1/bin
│ │ ├── include -> 12.1/include
│ │ ├── lib64 -> 12.1/lib64
│ │ └── nvvm -> 12.1/nvvm
│ ├── examples
│ │ ├── AutoPar
│ │ ├── CUDA-Fortran
│ │ ├── CUDA-Libraries
│ │ ├── F2003
│ │ ├── MPI
│ │ ├── NVLAmath
│ │ ├── OpenACC
│ │ ├── OpenMP
│ │ ├── README
│ │ └── stdpar
│ ├── math_libs
│ │ ├── 11.0
│ │ ├── 11.8
│ │ ├── 12.1
│ │ ├── include -> 12.1/include
│ │ └── lib64 -> 12.1/lib64
│ ├── profilers
│ │ ├── Nsight_Compute
│ │ └── Nsight_Systems
│ └── REDIST
│ ├── comm_libs
│ ├── compilers
│ ├── cuda
│ └── math_libs
└── modulefiles
├── nvhpc
│ └── 23.5
├── nvhpc-byo-compiler
│ └── 23.5
├── nvhpc-hpcx
│ └── 23.5
├── nvhpc-hpcx-cuda11
│ └── 23.5
├── nvhpc-hpcx-cuda12
│ └── 23.5
└── nvhpc-nompi
└── 23.5
67 directories, 9 files
[gpu2]$ tree -L 4 $LCL
/scr/nvhpc
0 directories, 0 files
Interestingly, the ./install program says nothing about running makelocalrc, but instead says the following:
For 64-bit NVIDIA HPC SDK compilers on 64-bit Linux systems, do the following:
/public/opt/nvhpc/Linux_x86_64/23.5/compilers/bin/add_network_host
Having done that, looking at the NVHPC_INSTALL_DIR the only difference is the appearance of a localrc.gpu1 and localrc.gpu1-lock file in /public/opt/nvhpc/Linux_x86_64/23.5/compilers/bin.
Nothing appears in the local directory (but at least this time it created it):
[gpu1]$ tree -L 4 $LCL
/scr/nvhpc
0 directories, 0 files
I’m at a loss. I don’t think I get what it says I should (ie. separate lib files) and it doesn’t look like there’s any customization (other than the localrc.host file) for the different hosts. I looked at the localrc.host file and it mostly seems to be setting up the compiler to use. Similar to someone in an earlier post (HPCSDK 22.7 Installation issues - #6 by pjh40) we install multiple compilers and use environment modules to change between them so I’m not even sure the localrc.host setup is gonna be that appropriate.
Is this the way it’s supposed to work?
What gets run on the “other” hosts? makelocalrc? add_local_host? both? (tried both and it didn’t seem to be useful)
As an extra issue, having the “local” directories is acutally a bit of a pain. What I’d really like is to have everything come off the NFS mount. Where there’s a “customization” not for each host, but for “classes” of hosts that have the same configuration (ex. all the hosts in a homogenous cluster). For example:
/public/opt/nvhpc/Linux_x86_64/…
/public/opt/nvhpc/config1/…
/public/opt/nvhpc/config2/…
.
.
.
The documentation sounds like I could probably do this, but doesn’t go into detail:
Note: The makelocalrc command does allow the flexibility of having local directories with different names on different machines. However, using the same directory on different machines allows users to easily move executables between systems that use NVIDIA-supplied shared libraries.
Anyway, any explanation/clarification is welcome.