Running HPCX-OpenMPI included in NVIDIA HPC SDK 24.1 causes unusual segfault

Hello,
I am working on porting a large proprietary code written in modern Fortran to GPU and using hybrid MPI/OpenMP parallelization. For GPU acceleration, I am using OpenACC. Our system is based on 3 GPU nodes with 8x NVIDIA RTX A5000 in each node, alongside 2x AMD EPYC 7543 32-core CPU and 2 Tb of RAM per node. For internodal connection, we are using Infiniband 100G. We recently moved to Rocky Linux 8.7 from CentOS 7 for future compatibility reasons, and currently are setting up our working environment.

I installed the NVIDIA HPC SDK 24.1 as a root using yum as provided on the NVIDIA website. The .bashrc file was set up in the following way as suggested by the guidelines:

NVARCH=`uname -s`_`uname -m`; export NVARCH
NVCOMPILERS=/opt/nvidia/hpc_sdk; export NVCOMPILERS
MANPATH=$MANPATH:$NVCOMPILERS/$NVARCH/24.1/compilers/man; export MANPATH
PATH=$NVCOMPILERS/$NVARCH/24.1/compilers/bin:$PATH; export PATH

export PATH=$NVCOMPILERS/$NVARCH/24.1/comm_libs/mpi/bin:$PATH
export MANPATH=$MANPATH:$NVCOMPILERS/$NVARCH/24.1/comm_libs/mpi/man

export MODULEPATH=$NVCOMPILERS/modulefiles:$MODULEPATH
module load nvhpc-hpcx

Driver Version: 535.154.05

The code was compiled using /opt/nvidia/hpc_sdk/Linux_x86_64/24.1/comm_libs/mpi/bin/mpifort compiler.
The nvfortran-specific compiler flags are the following:

-fast -O3 -mp -cuda -acc -gpu=cc86,deepcopy,cuda12.3,lineinfo -gopt -traceback -Minfo=accel -cpp -Mlarge_arrays -Mbackslash 

After that, the code was executed using the following command line for 2 GPU nodes:

mpirun --bind-to none -N 32 -x CUDA_LAUNCH_BLOCKING=1 --mca pml ucx --mca osc ucx -hostfile hostfile_gpu sh gpu_script_rank.sh fortran_exe_file input_file.inp

The hostfile is below:

host02
host03 

The gpu_script_rank.sh file is the following:

#!/bin/bash
let ngpus=4
if [[ -n ${OMPI_COMM_WORLD_LOCAL_RANK} ]]
then
let lrank=${OMPI_COMM_WORLD_LOCAL_RANK}
let device=$lrank/$ngpus
export CUDA_VISIBLE_DEVICES=$device
fi
echo $lrank $device $CUDA_VISIBLE_DEVICES
echo "$@"
# x
"$@"

The purpose of the script file is to prevent creation of waste processes of rank 0 and size 260 Mb each at the default GPU (this solution was provided in one of the topics of this forum). Also, this script allows me to run exactly 4 processes per GPU, following the 32 MPI processes per node.

Prior to execution, I initialized HPCX using the following lines, as suggested in this forum topic:

source /opt/nvidia/hpc_sdk/Linux_x86_64/24.1/comm_libs/12.3/hpcx/hpcx-2.17.1/hpcx-mt-init-ompi.sh
hpcx_load

So, based on the command line, we can expect the code use UCX for internodal communication. Prior to the OS change, we were using HPC SDK v. 23.5 with BTL connection that was very slow and penalizing. But, after the upgrade and all the setup explained above, we launched our code and finally observed great communication speed. However, to our disappointment, after the 2nd iteration step, our code crashed with the following segfault message:

[pgpu02:418924:0:418924] Caught signal 11 **(Segmentation fault: Sent by the kernel at address (nil))**
==== backtrace (tid: 418924) ====
 0 0x0000000000012cf0 __funlockfile()  :0
 1 0x000000000009d001 __GI___libc_free()  :0
 2 0x00000000004a65a2 mod_coup_fb_mp_()  /home/code_address/MOD_COUP.f90:4914
 3 0x0000000000513566 mod_evp_coup_driver_()  /home/code_address/MOD_EVP.f90:1316
 4 0x0000000000510b18 mod_evp_sol_moc_()  /home/code_address/MOD_EVP.f90:603
 5 0x00000000004c1d4c mod_dep_dep_main_()  /home/code_address/MOD_DEP.f90:109
 6 0x0000000000421671 MAIN_()  /home/code_address/MAIN.f90:88
 7 0x000000000041ca31 main()  ???:0
 8 0x000000000003ad85 __libc_start_main()  ???:0
 9 0x000000000041c39e _start()  ???:0
=================================
[pgpu02:418924] *** Process received signal ***
[pgpu02:418924] Signal: Segmentation fault (11)
[pgpu02:418924] Signal code:  (-6)
[pgpu02:418924] Failing at address: 0x3eb0006646c
[pgpu02:418924] [ 0] /lib64/libpthread.so.0(+0x12cf0)[0x153efbd3fcf0]
[pgpu02:418924] [ 1] /lib64/libc.so.6(cfree+0x21)[0x153efb155001]
[pgpu02:418924] [ 2] /home/code_address/bld/fortran_exe_file (mod_coup_fb_mp_+0x9ee2)[0x4a65a2]
[pgpu02:418924] [ 3] /home/code_address/bld/fortran_exe_file (mod_evp_coup_driver_+0x266)[0x513566]
[pgpu02:418924] [ 4] /home/code_address/bld/fortran_exe_file (mod_evp_sol_moc_+0xd58)[0x510b18]
[pgpu02:418924] [ 5] /home/code_address/bld/fortran_exe_file (mod_dep_dep_main_+0x70c)[0x4c1d4c]
[pgpu02:418924] [ 6] /home/code_address/bld/fortran_exe_file (MAIN_+0x71)[0x421671]
[pgpu02:418924] [ 7] /home/code_address/bld/fortran_exe_file (main+0x31)[0x41ca31]
[pgpu02:418924] [ 8] /lib64/libc.so.6(__libc_start_main+0xe5)[0x153efb0f2d85]
[pgpu02:418924] [ 9] /home/code_address/bld/fortran_exe_file (_start+0x2e)[0x41c39e]
[pgpu02:418924] *** End of error message ***
gpu_script_rank.sh: line 12: 418924 Segmentation fault      (core dumped) "$@"

I have seen various segfaults in my life but never experienced such a vague message as shown in bold. Internet search on what could it be was not helpful. The 4914 line in the code module is the following:

  4913            return
  4914            end subroutine FB_MP

My questions are:

  1. Is the setup of NVIDIA HPC SDK 24.1 with HPCX and OpenMPI provided above correct? Is there anything we should change/add/omit?
  2. What is the meaning of the highlighted Segfault message? Is that a code issue or a compiler-related message? The MOD_COUP module is CPU-only with no GPU instructions. The code always breaks in that particular part, while in our previous OS setup and communication type it was working fine (just the connection speed was slow).
  3. Is there a way to use UCX communication in NVIDIA HPC SDK 24.1 without using HPCX?

Any help or suggestions on the config and error message above would be greatly appreciated.

Seg faults are very generic so the root cause could be any number of things. Though here it’s seg faulting in “funlockfile” which locks and unlocks stdio files, so likely the file pointer is null.

Why this is occurring I’m not sure, but my best guess is that it’s some memory corruption issue.

Stack overflows are a common problem with OpenMP, so you should first try setting the “OMP_STACK_SIZE” environment variable in your bash script to a large value (like 500MB) as well as setting the shell stack size limit to “unlimited”. Stack overflows typically occur on entry to a subroutine, not the exit, so may not be the problem, but it’s easy to try.

The next thing I’d try is to run the program under the valgrind utility, i.e. “mpirun … sh gpu_script_rank.sh valgrind fortran_exe_file”.

If you can reproduce the error with one rank, that would be preferred, but if not that’s fine but the valgrind output from each rank will be mixed together making it a bit hard to read. In both cases, pipe the output to a text file for review.

Valgind is a great utility to find memory corruption issues like out-of-bound memory access or uninitialized memory reads (UMRs). It does tend to give some false positives, such as when MPI is getting initialized, which you can ignore. Focus more on the output just before the segv.

Try these for now but if neither helps, let me know and we can think of other things to try.

Dear Mat,

Thank you for providing insights on possible causes of the problem. Here are the observations that we found after trying your suggestions, as well as some other things.

  1. I increased the OMP_STACK_SIZE by adding the following line to our execution command: "-x OMP_STACK_SIZE=2G ". Unfortunately, it did not solve the problem but occupied more system memory.
  2. I ran the code with valgrind as you suggested, by adding the “sh gpu_script_rank.sh valgrind --log-file=valgrind_log_28-x.log fortran_exe_file” command line. I attach the valgrind logs below.
    a) When running with valgrind, the code did not reach anywhere near the point where it crashed before. When using many MPI processes, the code was crashing due to running out of memory. Our problem size is large and consumes around 1 Tb out of installed 2 Tb per node, but I am unsure why all the remaining memory has vanished.
    b) When running with few MPI processes, our code crashed early on due to the error specific to the calculation process we perform. Basically, it said that the material data saved in data arrrays was incorrect, meaning that your guess regarding memory corruption is very likely to be correct.
  3. In the code, there is a very strange inherited feature - nested subroutines inside other subroutines. These nested subroutines do not have variable declaration, so we can assume they rely on “host association” of some sort. I found perfect example of what is happening in our code in the following stackoverflow question: https://stackoverflow.com/questions/68795343/using-varibles-in-subroutines-without-passing-them-in-fortran . Unfortunately, this question was marked as duplicate so it did not get a proper answer. At least, for the most important part, in my opinion, here I quote:
Very recently I tried to compile it with <compiler_name> but on multiple occasions, the variables in (sub)subroutine contained rubbish data (but not all the time).

This is what I observed in the code, where some variables inside the nested subroutines took incorrect values (usually large negative number) as if they were not initialized. I believe that this behavior is causing problems with the code, but it is really hard to catch.

This is why I wanted to ask your advice on further steps we could try:
4. I wonder if we could use some other debugger to better track the problem. In particular, what is the way to use gdb for our problem type and MPI environment? In your opinion, would it be helpful?
5. Do you recall any issues reported on the nested subroutine usage, similar to the quote I posted above? Is using such subroutines considered normal in nvfortran, or should we work on converting them into proper module-declared subroutines?
6. Surprisingly, when I compiled the code with the flag “-gpu=pinned”, the code worked flawlessly without any segfaults. However, when I tried a different problem, it broke even with all memory allocated as pinned, this time pointing to a useful line of code within the FB_MP subroutine with a more appropriate error:

Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x1544483a0df8)

This is why I assume that there is something going on with the nested subroutines. I will further investigate the error. We are happy to know that we can allocate all our memory as pinned, but it is not very practical, as it takes more time to allocate and deallocate it. Moreover, it does not really solve the problem, just does not trigger it for certain input files. I wonder why it helps in that particular case, though.

Best,
Siarhei

valgrind_log_28-2.log (10.0 KB)
valgrind_log_28-3.log (75.0 KB)

Using gdb is only really useful when running a single rank. If you can reduce the problem size so it can run with a single rank, and still reproduce the problem, then it might help.

Otherwise, I’d try using valgind again, but with the full rank count. It will mean more log files, but hopefully give some useful info.

  1. Do you recall any issues reported on the nested subroutine usage, similar to the quote I posted above? Is using such subroutines considered normal in nvfortran, or should we work on converting them into proper module-declared subroutines?

In general contained subroutines are fine to use. They are a bit older style of programming but still comply with the standard and you’ll see their use in many Fortran programs.

The only issue with contain subroutines is that they can’t be OpenACC “device” routines when the containing routine is not a “device” routine itself. The issue is that contain subroutines pass a hidden pointer to the caller’s stack thus allowing access to the callers variables. However this is pointer to a host address which isn’t accessible on the device.

  1. Surprisingly, when I compiled the code with the flag “-gpu=pinned”, the code worked flawlessly without any segfaults. However, when I tried a different problem, it broke even with all memory allocated as pinned, this time pointing to a useful line of code within the FB_MP subroutine with a more appropriate error:

In order to transfer memory to the device, the memory must be in non-swappable memory. If it’s in virtual memory, then the OS could swap the memory mid-transfer.

By default, we use a double buffering system which copies the virtual memory to a set of pinned memory buffers and then do the transfer.

What “pinned” does is allocate memory directly in the host’s physical memory so these buffers aren’t needed. The issue with “pinned” is that physical memory is finite and allocation can take longer. However if a program has few allocations but many data transfers, it can help performance.

The fact your program works with “pinned”, in at least one case, indicates to me that indeed there’s some memory corruption or uninitialized memory.

This is why I assume that there is something going on with the nested subroutines.

Very possible. Though a seg fault is on the host side so what ever is happening is happening in the host code, not the device.