Please refer to the Release Notes for full details.
Download the current release at https://developer.nvidia.com/nvidia-hpc-sdk-downloads.
View the current documentation.
Please refer to the Release Notes for full details.
Download the current release at https://developer.nvidia.com/nvidia-hpc-sdk-downloads.
View the current documentation.
CMake support for C++23 standard with NVHPC is coming soon: NVHPC: Record C++23 support (!9859) · Merge requests · CMake / CMake · GitLab
Could NVIDIA please document the necessary --gcc-toolchain=/path/to/gcc12
so that C++23 etc. support is actually usable.
Hi Scivision,
Apologies, but I’m not clear on what you would like documented. We don’t document CMake since it’s not our product (though we do have folks that work with kitware), and the “–gcc-toolchain” flag is documented, but not always necessary.
The core issue is that in order to be interoperable with GNU g++, nvc++ must use the g++ header files. Hence our language level support depends on the g++ version nvc++ is configured to use. By default, nvc++ will use the system header files.
If you need to configure nvc++ to use a non-default g++, then you would either add the “–gcc-toolchain” to this location, or update the compiler’s local config file ( found under “~/.config/NVIDIA//localrc.”). If you don’t want the localrc file to apply to all compiles, you can create a new config and then use the environment variable NVLOCALRC to point to new config.
Documentation for this configuration process is something we lack, is this what you’re looking for?
Thanks,
Mat
I have a Fortran code with OpenACC that is designed to run either on a GPU or a multicore CPU by simply changing the compiler flag (acc=gpu
or acc=multicore
). Until NVHPC version 24.5, the acc=multicore -cuda
flags correctly parallelized the code on CPU cores. However, starting from version 24.7 (and also in 24.9), the acc=multicore -cuda
flags parallelize it on the GPU. I can exclude the -cuda
flag to prevent it from running on the GPU, but that’s inconvenient because I have some variables declared as “pinned” that are not recognized without the -cuda
flag. Was this an intentional change starting from NVHPC 24.7?
I believe it was. “-cuda” specifies CUDA Fortran so requires device code generation.
Though the work around is actually the preferred method. Instead of having to create two different binaries, you can create a single “unified binary”, i.e. “-acc=multicore,gpu”, and toggle between the host and device targets via the environment variable “ACC_DEVICE_TYPE”. Setting this to “host” will run using the multicore-cpus, and “nvidia” will target the GPUs.
When not set, the default is to use “nvidia” when there’s a supported GPU on the system, “host” otherwise.
Many thanks, Mat! The problem was solved when I wanted the code to run solely on a multicore CPU by setting the environment variable ACC_DEVICE_TYPE=host
along with the acc=multicore,gpu
flag.
The other issue I want to ask about is running the code in hybrid mode (multicore CPU + GPU). My code is a CFD solver that employs OpenACC and MPI, where each GPU handles one MPI process. Unfortunately, for some reason, there are sequences in the code that need to run on the multicore CPU. I can easily define the region by using call acc_set_device_type(acc_device_host)
or call acc_set_device_type(acc_device_nvidia)
, and compile using the acc=multicore,gpu
flag.
So far, the code has run smoothly when using more than one GPU. The host parallel region is distributed across the number of cores as set by acc_set_num_cores
. What confuses me is that when I run the code with a single GPU (1 MPI process), it appears that the host part only executes on a single core. Are there any environment variable settings that need to be configured? Perhaps thread binding options like OMP_PROC_BIND
or something else?
“ACC_NUM_CORES” is the environment variable, but since you’re using the API call, you shouldn’t need to use it.
I’m wondering if this is more a binding issue with mpirun. What bind flag are you setting? Typically the default is “–bind-to core”, in which case all the CPU processes will bind to the same core. I personally use “–bind-to none” and then use numactl to do the binding. Though you can also try “–bind-to socket”.
If that’s not it, then I’m not sure. I doesn’t quite make sense that it would work with multiple ranks, but not with one rank. There shouldn’t be a difference.
I wish we supported the OpenACC “self” clause which says to run a loop in parallel on the host. The use case is rare, so it got pushed lower on the priority list. Though I’ll let engineering know. Your method works fine, it’s just a bit cumbersome. Another alternative is to use OpenMP for the host parallel loops.
I executed the command with “mpirun --bind-to none -np 1 ./program”. I also tried changing the binding as you suggested and used numactl, but I still have not seen any improvement.
The thing is, for example, when I compile the code with “acc_set_num_cores(4)” and then execute it with “mpirun --bind-to none -np 2 ./program”, it runs correctly on 2 GPUs, and each CPU process shows a utilization of up to 400%. However, if I run the same compiled code with “mpirun --bind-to none -np 1 ./program”, it runs on 1 GPU, but the CPU utilization is only up to 100%.
It’s not a big deal, though, because I use multiple GPUs for most cases. I’m just curious because, as you mentioned, if it works with multiple ranks, it should also work with one rank.
By the way, I cannot use “-acc=gpu -mp=multicore” right now, because we still need the code to run with OpenMP-MPI on CPU clusters that use a different compiler.
Thank you for your assistance, Mat. I really appreciate your help!
Ok, wish I could be of more help. Programmatically it sounds correct and why I thought it might be more environmental.