Just released: HPC SDK 24.9

jmudd · September 25, 2024, 10:45pm

Please refer to the Release Notes for full details.

Download the current release at https://developer.nvidia.com/nvidia-hpc-sdk-downloads.

scivision · September 29, 2024, 1:04am

CMake support for C++23 standard with NVHPC is coming soon: NVHPC: Record C++23 support (!9859) · Merge requests · CMake / CMake · GitLab

Could NVIDIA please document the necessary --gcc-toolchain=/path/to/gcc12 so that C++23 etc. support is actually usable.

MatColgrove · September 30, 2024, 4:12pm

Hi Scivision,

Apologies, but I’m not clear on what you would like documented. We don’t document CMake since it’s not our product (though we do have folks that work with kitware), and the “–gcc-toolchain” flag is documented, but not always necessary.

The core issue is that in order to be interoperable with GNU g++, nvc++ must use the g++ header files. Hence our language level support depends on the g++ version nvc++ is configured to use. By default, nvc++ will use the system header files.

If you need to configure nvc++ to use a non-default g++, then you would either add the “–gcc-toolchain” to this location, or update the compiler’s local config file ( found under “~/.config/NVIDIA//localrc.”). If you don’t want the localrc file to apply to all compiles, you can create a new config and then use the environment variable NVLOCALRC to point to new config.

Documentation for this configuration process is something we lack, is this what you’re looking for?

Thanks,
Mat

fandi.ds · October 4, 2024, 9:13pm

I have a Fortran code with OpenACC that is designed to run either on a GPU or a multicore CPU by simply changing the compiler flag (acc=gpu or acc=multicore). Until NVHPC version 24.5, the acc=multicore -cuda flags correctly parallelized the code on CPU cores. However, starting from version 24.7 (and also in 24.9), the acc=multicore -cuda flags parallelize it on the GPU. I can exclude the -cuda flag to prevent it from running on the GPU, but that’s inconvenient because I have some variables declared as “pinned” that are not recognized without the -cuda flag. Was this an intentional change starting from NVHPC 24.7?

MatColgrove · October 4, 2024, 11:06pm

I believe it was. “-cuda” specifies CUDA Fortran so requires device code generation.

Though the work around is actually the preferred method. Instead of having to create two different binaries, you can create a single “unified binary”, i.e. “-acc=multicore,gpu”, and toggle between the host and device targets via the environment variable “ACC_DEVICE_TYPE”. Setting this to “host” will run using the multicore-cpus, and “nvidia” will target the GPUs.

When not set, the default is to use “nvidia” when there’s a supported GPU on the system, “host” otherwise.

fandi.ds · October 6, 2024, 3:32pm

Many thanks, Mat! The problem was solved when I wanted the code to run solely on a multicore CPU by setting the environment variable ACC_DEVICE_TYPE=host along with the acc=multicore,gpu flag.

The other issue I want to ask about is running the code in hybrid mode (multicore CPU + GPU). My code is a CFD solver that employs OpenACC and MPI, where each GPU handles one MPI process. Unfortunately, for some reason, there are sequences in the code that need to run on the multicore CPU. I can easily define the region by using call acc_set_device_type(acc_device_host) or call acc_set_device_type(acc_device_nvidia), and compile using the acc=multicore,gpu flag.

So far, the code has run smoothly when using more than one GPU. The host parallel region is distributed across the number of cores as set by acc_set_num_cores. What confuses me is that when I run the code with a single GPU (1 MPI process), it appears that the host part only executes on a single core. Are there any environment variable settings that need to be configured? Perhaps thread binding options like OMP_PROC_BIND or something else?

MatColgrove · October 7, 2024, 4:57pm

“ACC_NUM_CORES” is the environment variable, but since you’re using the API call, you shouldn’t need to use it.

I’m wondering if this is more a binding issue with mpirun. What bind flag are you setting? Typically the default is “–bind-to core”, in which case all the CPU processes will bind to the same core. I personally use “–bind-to none” and then use numactl to do the binding. Though you can also try “–bind-to socket”.

If that’s not it, then I’m not sure. I doesn’t quite make sense that it would work with multiple ranks, but not with one rank. There shouldn’t be a difference.

I wish we supported the OpenACC “self” clause which says to run a loop in parallel on the host. The use case is rare, so it got pushed lower on the priority list. Though I’ll let engineering know. Your method works fine, it’s just a bit cumbersome. Another alternative is to use OpenMP for the host parallel loops.

fandi.ds · October 8, 2024, 12:46pm

I executed the command with “mpirun --bind-to none -np 1 ./program”. I also tried changing the binding as you suggested and used numactl, but I still have not seen any improvement.

The thing is, for example, when I compile the code with “acc_set_num_cores(4)” and then execute it with “mpirun --bind-to none -np 2 ./program”, it runs correctly on 2 GPUs, and each CPU process shows a utilization of up to 400%. However, if I run the same compiled code with “mpirun --bind-to none -np 1 ./program”, it runs on 1 GPU, but the CPU utilization is only up to 100%.

It’s not a big deal, though, because I use multiple GPUs for most cases. I’m just curious because, as you mentioned, if it works with multiple ranks, it should also work with one rank.

By the way, I cannot use “-acc=gpu -mp=multicore” right now, because we still need the code to run with OpenMP-MPI on CPU clusters that use a different compiler.

Thank you for your assistance, Mat. I really appreciate your help!

MatColgrove · October 8, 2024, 6:15pm

Ok, wish I could be of more help. Programmatically it sounds correct and why I thought it might be more environmental.

Topic		Replies	Views
C++ Smart Pointers and OpenACC nvc, nvc++ and nvfortran nvcc	3	314	July 31, 2024
Hybrid runs on CPU and GPU - OpenACC nvc, nvc++ and nvfortran openmpi	6	1458	May 23, 2022
Enabling OpenMP offload breaks OpenACC code nvc, nvc++ and nvfortran	6	1257	December 1, 2021
Can an OpenACC accelerated shared object contain cpu and gpu code both? nvc, nvc++ and nvfortran	3	258	April 30, 2024
Nvc++ OpenACC runtime segfaults if Intel MKL (numpy) is already loaded nvc, nvc++ and nvfortran	8	1252	October 7, 2023
Openacc with cuda nvc, nvc++ and nvfortran cuda	4	394	April 22, 2023
Issue of Running OpenMPI on Multiple GPU Nodes with InfiniBand nvc, nvc++ and nvfortran openmpi	12	2030	March 11, 2024
Running HPCX-OpenMPI included in NVIDIA HPC SDK 24.1 causes unusual segfault nvc, nvc++ and nvfortran networking-ucx , openmpi , hpc-x	3	677	February 29, 2024
Dynamically loading an OpenACC-enabled shared library from an executable compiled with nvc++ does not work nvc, nvc++ and nvfortran	5	863	April 13, 2022
Device selection issue with OpenMP target mixed with do concurrent/OpenACC nvc, nvc++ and nvfortran	4	420	December 1, 2023

Just released: HPC SDK 24.9

Related topics