HPC SDK 23.5 "network" installation confusion

I’m trying to install the NVIDIA HPC SDK as a “network” installation. I have a mix of machines from CentOS7, RHEL8, and (eventually) RHEL9. They run a mix of CUDA 11.0, 11.2, 12.1. (I’m ignoring the 11.2 ones because it doesn’t seem to be included in the HPC SDK tar file for multiple CUDA versions).

I’m confused by the “network” install and would like to understand what I’m doing wrong.

In the following discussion :

/public/opt/nvhpc : a directory mounted via NFS to all hosts
/scr/nvhpc : a directory local to each host

I download the software :

https://developer.download.nvidia.com/hpc-sdk/23.5/nvhpc_2023_235_Linux_x86_64_cuda_multi.tar.gz

and the installation documentation :

https://docs.nvidia.com/hpc-sdk/pdf/hpc-sdk235install.pdf

Per the instructions I setup to do a silent install:

[gpu2]$ cd /public/src/nvhpc_2023_235_Linux_x86_64_cuda_multi/
[gpu2]$ setenv NET /public/opt/nvhpc
[gpu2]$ setenv LCL /scr/nvhpc
[gpu2]$ setenv REL 23.5
[gpu2]$ 
[gpu2]$ setenv NVHPC_SILENT "true"
[gpu2]$ setenv NVHPC_INSTALL_DIR ${NET}
[gpu2]$ setenv NVHPC_INSTALL_TYPE "network"
[gpu2]$ setenv NVHPC_INSTALL_LOCAL_DIR ${LCL}
[gpu2]$ setenv NVARCH Linux_x86_64
[gpu2]$ setenv NVHPC_DEFAULT_CUDA 12.1
[gpu2]$ ./install
generating environment modules for NV HPC SDK 23.5 ... done.

Looking at what’s created in the NVHPC_INSTALL_DIR and NVHPC_INSTALL_LOCAL_DIR

[gpu2]$ tree -L 4 $NVHPC_INSTALL_DIR
/public/opt/nvhpc
├── Linux_x86_64
│   ├── 2023 -> 23.5
│   └── 23.5
│       ├── cmake
│       │   ├── NVHPCConfig.cmake
│       │   └── NVHPCConfigVersion.cmake
│       ├── comm_libs
│       │   ├── 11.0
│       │   ├── 11.8
│       │   ├── 12.1
│       │   ├── hpcx
│       │   ├── mpi -> openmpi/openmpi-3.1.5
│       │   ├── nccl -> 12.1/nccl
│       │   ├── nvshmem -> 12.1/nvshmem
│       │   ├── openmpi
│       │   └── openmpi4
│       ├── compilers
│       │   ├── bin
│       │   ├── etc
│       │   ├── extras
│       │   ├── include
│       │   ├── include_acc
│       │   ├── include_man
│       │   ├── include-stdexec
│       │   ├── include-stdpar
│       │   ├── lib
│       │   ├── license
│       │   ├── man
│       │   ├── share
│       │   └── src
│       ├── cuda
│       │   ├── 11.0
│       │   ├── 11.8
│       │   ├── 12.1
│       │   ├── bin -> 12.1/bin
│       │   ├── include -> 12.1/include
│       │   ├── lib64 -> 12.1/lib64
│       │   └── nvvm -> 12.1/nvvm
│       ├── examples
│       │   ├── AutoPar
│       │   ├── CUDA-Fortran
│       │   ├── CUDA-Libraries
│       │   ├── F2003
│       │   ├── MPI
│       │   ├── NVLAmath
│       │   ├── OpenACC
│       │   ├── OpenMP
│       │   ├── README
│       │   └── stdpar
│       ├── localrc.gpu2
│       ├── localrc.gpu2-lock
│       ├── math_libs
│       │   ├── 11.0
│       │   ├── 11.8
│       │   ├── 12.1
│       │   ├── include -> 12.1/include
│       │   └── lib64 -> 12.1/lib64
│       ├── profilers
│       │   ├── Nsight_Compute
│       │   └── Nsight_Systems
│       └── REDIST
│           ├── comm_libs
│           ├── compilers
│           ├── cuda
│           └── math_libs
└── modulefiles
    ├── nvhpc
    │   └── 23.5
    ├── nvhpc-byo-compiler
    │   └── 23.5
    ├── nvhpc-hpcx
    │   └── 23.5
    ├── nvhpc-hpcx-cuda11
    │   └── 23.5
    ├── nvhpc-hpcx-cuda12
    │   └── 23.5
    └── nvhpc-nompi
        └── 23.5

67 directories, 11 files
[gpu2]$ tree -L 4 $NVHPC_INSTALL_LOCAL_DIR
/scr/nvhpc

0 directories, 0 files

Which doesn’t look right to me. It certainly didn’t create anything in the “local” directory for the machine I did the install on though it did create the localrc.gpu2 file.

The documentation then says go to other machines and do the following :

If your installation base directory is /opt/nvidia/hpc_sdk and /usr/nvidia/shared/23.5 is the common local directory, then run the following commands on each system on the network.

/opt/nvidia/hpc_sdk/$NVARCH/23.5/compilers/bin/makelocalrc -x /opt/nvidia/hpc_sdk/$NVARCH/23.5 -net /usr/nvidia/shared/23.5

These commands create a system-dependent file localrc.machinename in the /opt/nvidia/hpc_sdk/$NVARCH/23.5/compilers/bin directory. The commands also create the following three directories containing libraries and shared objects specific to the operating system and system libraries on that machine:

    /usr/nvidia/shared/23.5/lib
    /usr/nvidia/shared/23.5/liblf
    /usr/nvidia/shared/23.5/lib64

So going to one of my other machines and substituting my directory locations this becomes :

[gpu1]$ cd /public/src/nvhpc_2023_235_Linux_x86_64_cuda_multi/
[gpu1]$ setenv NET /public/opt/nvhpc
[gpu1]$ setenv LCL /scr/nvhpc
[gpu1]$ setenv REL 23.5
[gpu1]$ 
[gpu1]$ setenv NVHPC_SILENT "true"
[gpu1]$ setenv NVHPC_INSTALL_DIR ${NET}
[gpu1]$ setenv NVHPC_INSTALL_TYPE "network"
[gpu1]$ setenv NVHPC_INSTALL_LOCAL_DIR ${LCL}
[gpu1]$ setenv NVARCH Linux_x86_64
[gpu1]$ setenv NVHPC_DEFAULT_CUDA 12.1
[gpu1]$ $NET/$NVARCH/23.5/compilers/bin/makelocalrc -x $NET/$NVARCH/23.5 -net $LCL
/public/opt/nvhpc/Linux_x86_64/23.5/compilers/bin/makelocalrc: line 146: /public/opt/nvhpc/Linux_x86_64/23.5/nvaccelinfo: No such file or directory
find: '/public/opt/nvhpc/Linux_x86_64/23.5/../../cuda': No such file or directory
/public/opt/nvhpc/Linux_x86_64/23.5/compilers/bin/makelocalrc: line 158: bundled_cuda: bad array subscript

But that fails.

Best as I can tell from reading the code, the documenatation is wrong and the incantation should be:

[gpu1]$ $NET/$NVARCH/23.5/compilers/bin/makelocalrc $NET/$NVARCH/23.5/compilers/bin -x -net $LCL

That doesn’t choke, and the directories from the viewpoint of the second machine are:

[gpu1]$ tree -L 4 $NET
/public/opt/nvhpc
├── Linux_x86_64
│   ├── 2023 -> 23.5
│   └── 23.5
│       ├── cmake
│       │   ├── NVHPCConfig.cmake
│       │   └── NVHPCConfigVersion.cmake
│       ├── comm_libs
│       │   ├── 11.0
│       │   ├── 11.8
│       │   ├── 12.1
│       │   ├── hpcx
│       │   ├── mpi -> openmpi/openmpi-3.1.5
│       │   ├── nccl -> 12.1/nccl
│       │   ├── nvshmem -> 12.1/nvshmem
│       │   ├── openmpi
│       │   └── openmpi4
│       ├── compilers
│       │   ├── bin
│       │   ├── etc
│       │   ├── extras
│       │   ├── include
│       │   ├── include_acc
│       │   ├── include_man
│       │   ├── include-stdexec
│       │   ├── include-stdpar
│       │   ├── lib
│       │   ├── license
│       │   ├── man
│       │   ├── share
│       │   └── src
│       ├── cuda
│       │   ├── 11.0
│       │   ├── 11.8
│       │   ├── 12.1
│       │   ├── bin -> 12.1/bin
│       │   ├── include -> 12.1/include
│       │   ├── lib64 -> 12.1/lib64
│       │   └── nvvm -> 12.1/nvvm
│       ├── examples
│       │   ├── AutoPar
│       │   ├── CUDA-Fortran
│       │   ├── CUDA-Libraries
│       │   ├── F2003
│       │   ├── MPI
│       │   ├── NVLAmath
│       │   ├── OpenACC
│       │   ├── OpenMP
│       │   ├── README
│       │   └── stdpar
│       ├── localrc.gpu1
│       ├── localrc.gpu1-lock
│       ├── localrc.gpu2
│       ├── localrc.gpu2-lock
│       ├── math_libs
│       │   ├── 11.0
│       │   ├── 11.8
│       │   ├── 12.1
│       │   ├── include -> 12.1/include
│       │   └── lib64 -> 12.1/lib64
│       ├── profilers
│       │   ├── Nsight_Compute
│       │   └── Nsight_Systems
│       └── REDIST
│           ├── comm_libs
│           ├── compilers
│           ├── cuda
│           └── math_libs
└── modulefiles
    ├── nvhpc
    │   └── 23.5
    ├── nvhpc-byo-compiler
    │   └── 23.5
    ├── nvhpc-hpcx
    │   └── 23.5
    ├── nvhpc-hpcx-cuda11
    │   └── 23.5
    ├── nvhpc-hpcx-cuda12
    │   └── 23.5
    └── nvhpc-nompi
        └── 23.5

67 directories, 13 files
[gpu1]$ tree -L 4 $LCL
/scr/nvhpc [error opening dir]

0 directories, 0 files

So while makelocalrc did create a new localrc.gpu1, it didn’t create a local directory and certainly didn’t create the lib files that the documentation says should be there. (Poking around in makelocalrc it doesn’t look like the script ever does anything with locdir set by the -net argument.)

So… I then tried doing a non-silent install to see if that’s any different.

[gpu2]$ ./install

Welcome to the NVIDIA HPC SDK Linux installer!

You are installing NVIDIA HPC SDK 2023 version 23.5 for Linux_x86_64.
Please note that all Trademarks and Marks are the properties
of their respective owners.

Press enter to continue...


A network installation will save disk space by having only one copy of the
compilers and most of the libraries for all compilers on the network, and
the main installation needs to be done once for all systems on the network.

1  Single system install
2  Network install

Please choose install option: 
2

Please specify the directory path under which the software will be installed.
The default directory is /opt/nvidia/hpc_sdk, but you may install anywhere you wish,
assuming you have permission to do so.

Installation directory? [/opt/nvidia/hpc_sdk] 
/public/opt/nvhpc
Common local directory on all hosts for shared objects? [/public/opt/nvhpc/Linux_x86_64/23.5/share_objects]
/scr/nvhpc
Note: directory /scr/nvhpc was created.


Note: directory /public/opt/nvhpc was created.

Installing NVIDIA HPC SDK version 23.5 into /public/opt/nvhpc
Making symbolic link in /public/opt/nvhpc/Linux_x86_64

generating environment modules for NV HPC SDK 23.5 ... done.
Installation complete.
Please run add_network_host to create host specific localrc files:

   /public/opt/nvhpc/Linux_x86_64/23.5/compilers/bin/localrc.$host

on all other hosts you wish to run NVIDIA HPC SDK compilers.

For 64-bit NVIDIA HPC SDK compilers on 64-bit Linux systems, do the following:
    /public/opt/nvhpc/Linux_x86_64/23.5/compilers/bin/add_network_host

HPC SDK successfully installed into /public/opt/nvhpc
   .
   .
   .

which seems to install the same way as my “silent” install in the two directories.

[gpu2]$ tree -L 4 $NET
/public/opt/nvhpc
├── Linux_x86_64
│   ├── 2023 -> 23.5
│   └── 23.5
│       ├── cmake
│       │   ├── NVHPCConfig.cmake
│       │   └── NVHPCConfigVersion.cmake
│       ├── comm_libs
│       │   ├── 11.0
│       │   ├── 11.8
│       │   ├── 12.1
│       │   ├── hpcx
│       │   ├── mpi -> openmpi/openmpi-3.1.5
│       │   ├── nccl -> 12.1/nccl
│       │   ├── nvshmem -> 12.1/nvshmem
│       │   ├── openmpi
│       │   └── openmpi4
│       ├── compilers
│       │   ├── bin
│       │   ├── etc
│       │   ├── extras
│       │   ├── include
│       │   ├── include_acc
│       │   ├── include_man
│       │   ├── include-stdexec
│       │   ├── include-stdpar
│       │   ├── lib
│       │   ├── license
│       │   ├── man
│       │   ├── share
│       │   └── src
│       ├── cuda
│       │   ├── 11.0
│       │   ├── 11.8
│       │   ├── 12.1
│       │   ├── bin -> 12.1/bin
│       │   ├── include -> 12.1/include
│       │   ├── lib64 -> 12.1/lib64
│       │   └── nvvm -> 12.1/nvvm
│       ├── examples
│       │   ├── AutoPar
│       │   ├── CUDA-Fortran
│       │   ├── CUDA-Libraries
│       │   ├── F2003
│       │   ├── MPI
│       │   ├── NVLAmath
│       │   ├── OpenACC
│       │   ├── OpenMP
│       │   ├── README
│       │   └── stdpar
│       ├── math_libs
│       │   ├── 11.0
│       │   ├── 11.8
│       │   ├── 12.1
│       │   ├── include -> 12.1/include
│       │   └── lib64 -> 12.1/lib64
│       ├── profilers
│       │   ├── Nsight_Compute
│       │   └── Nsight_Systems
│       └── REDIST
│           ├── comm_libs
│           ├── compilers
│           ├── cuda
│           └── math_libs
└── modulefiles
    ├── nvhpc
    │   └── 23.5
    ├── nvhpc-byo-compiler
    │   └── 23.5
    ├── nvhpc-hpcx
    │   └── 23.5
    ├── nvhpc-hpcx-cuda11
    │   └── 23.5
    ├── nvhpc-hpcx-cuda12
    │   └── 23.5
    └── nvhpc-nompi
        └── 23.5

67 directories, 9 files
[gpu2]$ tree -L 4 $LCL
/scr/nvhpc

0 directories, 0 files

Interestingly, the ./install program says nothing about running makelocalrc, but instead says the following:

For 64-bit NVIDIA HPC SDK compilers on 64-bit Linux systems, do the following:
    /public/opt/nvhpc/Linux_x86_64/23.5/compilers/bin/add_network_host

Having done that, looking at the NVHPC_INSTALL_DIR the only difference is the appearance of a localrc.gpu1 and localrc.gpu1-lock file in /public/opt/nvhpc/Linux_x86_64/23.5/compilers/bin.

Nothing appears in the local directory (but at least this time it created it):

[gpu1]$ tree -L 4 $LCL
/scr/nvhpc

0 directories, 0 files

I’m at a loss. I don’t think I get what it says I should (ie. separate lib files) and it doesn’t look like there’s any customization (other than the localrc.host file) for the different hosts. I looked at the localrc.host file and it mostly seems to be setting up the compiler to use. Similar to someone in an earlier post (HPCSDK 22.7 Installation issues - #6 by pjh40) we install multiple compilers and use environment modules to change between them so I’m not even sure the localrc.host setup is gonna be that appropriate.

Is this the way it’s supposed to work?
What gets run on the “other” hosts? makelocalrc? add_local_host? both? (tried both and it didn’t seem to be useful)

As an extra issue, having the “local” directories is acutally a bit of a pain. What I’d really like is to have everything come off the NFS mount. Where there’s a “customization” not for each host, but for “classes” of hosts that have the same configuration (ex. all the hosts in a homogenous cluster). For example:

/public/opt/nvhpc/Linux_x86_64/…
/public/opt/nvhpc/config1/…
/public/opt/nvhpc/config2/…
.
.
.

The documentation sounds like I could probably do this, but doesn’t go into detail:

Note: The makelocalrc command does allow the flexibility of having local directories with different names on different machines. However, using the same directory on different machines allows users to easily move executables between systems that use NVIDIA-supplied shared libraries. 

Anyway, any explanation/clarification is welcome.

Hi Cfreese,

Sorry for the late response. I wanted to consult with our manufacturing folks. Looks like the docs are a bit out of date. There should be no need to install the compilers on each local system provided they have network access to the shared install directory.

They’re in the process of updating the docs to reflex this correction.

The main difference between the two are that a compiler configuration file (localrc.<hostname>) is generated for each system in a network install versus a single localrc in a system install.

We are in the process of revamping this so that the localrc file will go to the user’s home directory (under a hidden directory) and be per system. This should eliminate the issue with needing the network installation directory be writable so the localrc files can be created by users upon first use of the compilers on an individual system. It also means a single install method rather than the two.

-Mat

No problem. Good answers take time. ;^)

I was afraid that the installation process was compiling code and that there would be incompatibilities due to dependency on system libraries (i.e. RHEL7 and GLIBC_2.17 vs. RHEL8 and GLIBC_2.28). If not then I’m good to go.

One thing to consider with your putting the localrc file in the user’s home directory: We typically mount the user home directory across all machines so a home directory would have to be able to hold different localrc’s for different machines. This is probably easy to deal with if I can select the appropriate one via an environment variable (NV_LOCALRC?) that I can set based on hostname when I log in or environment modules. Just something to think about. (I’d much rather do that then have to be root and install the localrc someplace like /etc or /usr/local.)