Hello,
I am working on porting a large proprietary code written in modern Fortran to GPU and using hybrid MPI/OpenMP parallelization. For GPU acceleration, I am using OpenACC. Our system is based on 3 GPU nodes with 8x NVIDIA RTX A5000 in each node, alongside 2x AMD EPYC 7543 32-core CPU and 2 Tb of RAM per node. For internodal connection, we are using Infiniband 100G. We recently moved to Rocky Linux 8.7 from CentOS 7 for future compatibility reasons, and currently are setting up our working environment.
I installed the NVIDIA HPC SDK 24.1 as a root using yum as provided on the NVIDIA website. The .bashrc file was set up in the following way as suggested by the guidelines:
NVARCH=`uname -s`_`uname -m`; export NVARCH
NVCOMPILERS=/opt/nvidia/hpc_sdk; export NVCOMPILERS
MANPATH=$MANPATH:$NVCOMPILERS/$NVARCH/24.1/compilers/man; export MANPATH
PATH=$NVCOMPILERS/$NVARCH/24.1/compilers/bin:$PATH; export PATH
export PATH=$NVCOMPILERS/$NVARCH/24.1/comm_libs/mpi/bin:$PATH
export MANPATH=$MANPATH:$NVCOMPILERS/$NVARCH/24.1/comm_libs/mpi/man
export MODULEPATH=$NVCOMPILERS/modulefiles:$MODULEPATH
module load nvhpc-hpcx
Driver Version: 535.154.05
The code was compiled using /opt/nvidia/hpc_sdk/Linux_x86_64/24.1/comm_libs/mpi/bin/mpifort compiler.
The nvfortran-specific compiler flags are the following:
-fast -O3 -mp -cuda -acc -gpu=cc86,deepcopy,cuda12.3,lineinfo -gopt -traceback -Minfo=accel -cpp -Mlarge_arrays -Mbackslash
After that, the code was executed using the following command line for 2 GPU nodes:
mpirun --bind-to none -N 32 -x CUDA_LAUNCH_BLOCKING=1 --mca pml ucx --mca osc ucx -hostfile hostfile_gpu sh gpu_script_rank.sh fortran_exe_file input_file.inp
The hostfile is below:
host02
host03
The gpu_script_rank.sh file is the following:
#!/bin/bash
let ngpus=4
if [[ -n ${OMPI_COMM_WORLD_LOCAL_RANK} ]]
then
let lrank=${OMPI_COMM_WORLD_LOCAL_RANK}
let device=$lrank/$ngpus
export CUDA_VISIBLE_DEVICES=$device
fi
echo $lrank $device $CUDA_VISIBLE_DEVICES
echo "$@"
# x
"$@"
The purpose of the script file is to prevent creation of waste processes of rank 0 and size 260 Mb each at the default GPU (this solution was provided in one of the topics of this forum). Also, this script allows me to run exactly 4 processes per GPU, following the 32 MPI processes per node.
Prior to execution, I initialized HPCX using the following lines, as suggested in this forum topic:
source /opt/nvidia/hpc_sdk/Linux_x86_64/24.1/comm_libs/12.3/hpcx/hpcx-2.17.1/hpcx-mt-init-ompi.sh
hpcx_load
So, based on the command line, we can expect the code use UCX for internodal communication. Prior to the OS change, we were using HPC SDK v. 23.5 with BTL connection that was very slow and penalizing. But, after the upgrade and all the setup explained above, we launched our code and finally observed great communication speed. However, to our disappointment, after the 2nd iteration step, our code crashed with the following segfault message:
[pgpu02:418924:0:418924] Caught signal 11 **(Segmentation fault: Sent by the kernel at address (nil))**
==== backtrace (tid: 418924) ====
0 0x0000000000012cf0 __funlockfile() :0
1 0x000000000009d001 __GI___libc_free() :0
2 0x00000000004a65a2 mod_coup_fb_mp_() /home/code_address/MOD_COUP.f90:4914
3 0x0000000000513566 mod_evp_coup_driver_() /home/code_address/MOD_EVP.f90:1316
4 0x0000000000510b18 mod_evp_sol_moc_() /home/code_address/MOD_EVP.f90:603
5 0x00000000004c1d4c mod_dep_dep_main_() /home/code_address/MOD_DEP.f90:109
6 0x0000000000421671 MAIN_() /home/code_address/MAIN.f90:88
7 0x000000000041ca31 main() ???:0
8 0x000000000003ad85 __libc_start_main() ???:0
9 0x000000000041c39e _start() ???:0
=================================
[pgpu02:418924] *** Process received signal ***
[pgpu02:418924] Signal: Segmentation fault (11)
[pgpu02:418924] Signal code: (-6)
[pgpu02:418924] Failing at address: 0x3eb0006646c
[pgpu02:418924] [ 0] /lib64/libpthread.so.0(+0x12cf0)[0x153efbd3fcf0]
[pgpu02:418924] [ 1] /lib64/libc.so.6(cfree+0x21)[0x153efb155001]
[pgpu02:418924] [ 2] /home/code_address/bld/fortran_exe_file (mod_coup_fb_mp_+0x9ee2)[0x4a65a2]
[pgpu02:418924] [ 3] /home/code_address/bld/fortran_exe_file (mod_evp_coup_driver_+0x266)[0x513566]
[pgpu02:418924] [ 4] /home/code_address/bld/fortran_exe_file (mod_evp_sol_moc_+0xd58)[0x510b18]
[pgpu02:418924] [ 5] /home/code_address/bld/fortran_exe_file (mod_dep_dep_main_+0x70c)[0x4c1d4c]
[pgpu02:418924] [ 6] /home/code_address/bld/fortran_exe_file (MAIN_+0x71)[0x421671]
[pgpu02:418924] [ 7] /home/code_address/bld/fortran_exe_file (main+0x31)[0x41ca31]
[pgpu02:418924] [ 8] /lib64/libc.so.6(__libc_start_main+0xe5)[0x153efb0f2d85]
[pgpu02:418924] [ 9] /home/code_address/bld/fortran_exe_file (_start+0x2e)[0x41c39e]
[pgpu02:418924] *** End of error message ***
gpu_script_rank.sh: line 12: 418924 Segmentation fault (core dumped) "$@"
I have seen various segfaults in my life but never experienced such a vague message as shown in bold. Internet search on what could it be was not helpful. The 4914 line in the code module is the following:
4913 return
4914 end subroutine FB_MP
My questions are:
- Is the setup of NVIDIA HPC SDK 24.1 with HPCX and OpenMPI provided above correct? Is there anything we should change/add/omit?
- What is the meaning of the highlighted Segfault message? Is that a code issue or a compiler-related message? The MOD_COUP module is CPU-only with no GPU instructions. The code always breaks in that particular part, while in our previous OS setup and communication type it was working fine (just the connection speed was slow).
- Is there a way to use UCX communication in NVIDIA HPC SDK 24.1 without using HPCX?
Any help or suggestions on the config and error message above would be greatly appreciated.