Segmentation fault with NVHPC 25.1

I am running a model built using OpenMPI 5.0.7 and NVHPC 25.1. It builds successfully, but I get the following runtime error:

[ng10503:3038313:0:3038313] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x8a0aec8)
==== backtrace (tid:3038331) ====
0 0x0000000000038790 __sigaction() ???:0
1 0x00000000003b26cb pgf90_set_intrin_type_i8() ???:0

The backtrace indicates the fault is occurring with pgf90_set_intrin_type_i8(), so I am wondering if this is a compiler issue. This error occurs for both CPU runs and GPU runs. This error also occurs when using NVHPC 25.1 with OpenMPI 5.0.3.

This error does not occur for CPU runs with OpenMPI 4.1.5 and NVHPC 23.9. However I am unable to perform GPU runs with this combination because of limited GPU support in OpenMPI 4.x, so I am attempting to work with more recent versions of NVHPC and OpenMPI.

Do you have a reproducing example?

Chris let me know that one of our colleagues found a bug in 5.0.6 which causes build failures, and why he wanted to get more info about your build on the previous post.

Now this is a runtime failure and you’re using 5.0.7, so I’d be inclined to say it’s a compiler issue, but given it only occurs in 5.0.7, it could be an OpenMPI issue. Difficult to say without a way to reproduce the error.

-Mat

As I mentioned in my OP, the error occurs with both OpenMPI 5.0.3 and 5.0.7.

The climate model I am running is Earthworks, which has lots of dependencies. I can walk you through setting it up if you want to try to reproduce the error on your end.

My mistake.

The climate model I am running is Earthworks, which has lots of dependencies. I can walk you through setting it up if you want to try to reproduce the error on your end.

Let me see if Chris can get me a build of OpenMPI 5 to run against my MPI benchmarks to see if anything pops up. If I can’t recreate it there, I’ll then try to build Earthworks.

My initial tests show no issues so I’ll need to try to reproduce what you’re doing.

Is this the correct repo for EarthWorks? GitHub - EarthWorksOrg/EarthWorks

I tried following the build instructions, but I’m not sure what to put in for the case, project, res, etc. Any guidance would be appreciated.

I see that EarthWorks seems to be a composite of other weather apps like CAM, MPAS, CICE, etc. Is the error coming when you run one of these?

Ok, thanks for running those tests. But before we try to compile Earthworks on your system, I was thinking of trying to run a simple program that reproduces the segmentation fault. So I created the following C program:

#include <stdio.h>

void pgf90_set_intrin_type_i8();

int main() {
    printf("Starting...\n");
    pgf90_set_intrin_type_i8();
    printf("Done!\n");
    return 0;
}

I saved it as intrin_test.c and compiled it with the following command:

nvc intrin_test.c /path/to/nvhpc/25.1/lib/libnvf.so

It compiles successfully but produces a segmentation fault at runtime. But that could be just because pgf90_set_intrin_type_i8 hasn’t been declared or used correctly. So is my declaration/usage of pgf90_set_intrin_type_i8 correct? I do not have access to the source code that generated libnvf.so (it’s not open source after all :), so I don’t know the interface for pgf90_set_intrin_type_i8. Based on the contents of Linux_x86_64/25.1/compilers/lib/libnvf.ipl in the nvhpc installation package, the source for libnvf.so appears to be src/libpgf90.c.

ā€œpgf90_set_intrin_type_i8ā€ is a Fortran runtime call used to set the type of a polymorphic object. I can’t really be called here and you’re missing the arguments (i.e. a Fortran descriptor and the intrinsic type), which in turn is causing the segv.

This is why I would need to know which component from EarthWorks the error is coming from and I’ll see if I can build and run it to reproduce it. I have access to several of these, like MPAS and CAM, though I didn’t think either of these use polymorphic types. So it’s likely coming from one of the others.

(it’s not open source after all :)

Yes and no. We open sourced our Fortran front-end to LLVM and is what’s used by Flang. We’re also working with LLVM community on a new Fortran front-end compiler named F18. We use LLVM for the back-end compiler as well. Though you’re correct that our current Fortran runtime is not open, though will be (at least derived from) once F18 is released.

Ok. I built Earthworks 2.4.001 with the following dependencies:
openmpi/5.0.7
gcccore/13.3
nvhpc/25.1
cuda/12.6.2
hwloc/2.10.0
ucx/1.16.0
libfabric/1.21.0
pmix/5.0.2
prrte/3.0.5
ucc/1.3.0
gdrcopy/2.4.1
nccl/2.22.3
flexiblas/3.4.4
blis/1.0
perl/5.36.1
libxml/2.0208
python/3.11.5
cmake/3.31.0
szip/2.1.1
hdf5/1.13.2
pnetcdf/1.14.0
netcdf-c/4.9.2
netcdf-fortran/4.6.1
parallelio/2.6.2
esmf/8.7.0

If ESMF is not already built on your system, let me know, and I can send you more info on building that.

If you unpack the Earthworks code to my_earthworks_sandbox, you will have to set up the environment for your machine in my_earthworks_sandbox/ccs_config/machines/{$MachineName} where $MachineName is the name of your machine. This folder should contain three files: config_batch.xml, config_machines.xml, and nvhpc_{$MachineName}.cmake. If you have previously set up CESM, then the process is similar (as Earthworks is based on CESM), and you can look at examples from other machines in my_earthworks_sandbox/ccs_config/machines.

To set up, build and run a case…

cd my_earhworks_sandbox
cime/scripts/create_newcase --run-unsupported --compiler nvhpc --mpilib openmpi --res ne30pg3_ne30pg3_mg17 --compset F2000dev --case $CasePath --ngpus-per-node 4 --gpu-type a100 --gpu-offload openacc --machine $MachineName
cd $CasePath
./xmlchange JOB_WALLCLOCK_TIME=1:00:00 --subgroup case.run
./xmlchange DOUT_S=FALSE
./xmlchange CALENDAR=GREGORIAN
./xmlchange STOP_N=1
./xmlchange NTASKS=36
./xmlchange NTASKS_ESP=1
./case.setup
./case.build
./case.run

In the create_newcase command above, adjust the number of GPUs per node as needed. Also, I am using NTASKS=36 above just to be sure you don’t have issues with input datasets, but you could use a different value if you know what you’re doing.

If the build process fails because phys_control.mod can’t be found, add phys_control.o to line 568 of my_earthworks_sandbox/cime/CIME/Tools/Makefile and rebuild.

$CasePath/CaseStatus will show you the status of the run. If the run fails, this file will indicate the path of the log file containing the error. In my case, the segmentation fault occurred in cesm.log.* within the run directory. So the error isn’t coming from one of the individual model components, but rather from the parent process for the fully coupled model.

Ok, I’ll give it a try. I’ve built CESM before so know the basics, it just takes some effort getting all the dependencies built. Might take me some time.

Though do you know which component shows the segv? Again, I have access to several of these, like MPAS and CAM, so it might be easier if try to recreate it outside of EarthWorks first.

@MatColgrove The segv is occurring in CESM, which is the parent process for the folly coupled model. The segv is not occurring in an individual model component.

Hi again Mat - I have conducted a number of additional tests, and the failures are following the pattern found on NCAR’s Derecho system, as documented here:

That is, the failures are occurring just with configurations that have atmosphere coupled to land. On Derecho, atmosphere+land configurations produce memory leaks (rather than a segfault) when using nvhpc 24.7 through 24.11, but no such issues arise when using the intel compiler.

The workaround on Derecho has been to revert to NVHPC 24.3. I have asked the admins of the system I’m using (narval) if they are able to set up NVHPC 24.3 or older with openmpi5. But if you end up finding another workaround/solution, please let me know.

Thanks for the update Neil.

I’ve reached out to folks who do support for NCAR to see if they can help in narrowing down the issue so we can report it to our compiler team.

I don’t see anything similar in the stand-alone CAM or MPAS-A we test here, but it’s likely workload dependent so that’s not surprising.

Also if NCAR hasn’t done so already, you might have them reach out through their NVIDIA support channels so it can be escalated.

Indeed, our tests with standalone atmosphere (CAM/MPAS) and standalone land (CLM) run fine with NVHPC. It’s when atmosphere is coupled to land that the model does not get along with recent NVHPC versions.

Are you running CAM within CESM? If so, running CAM coupled to CLM would just involve changing the compset name.

Hi,
I’m an analyst working with the supercomputer where Neil runs his code and had some time to debug it. I was able to narrow it down to passing an array slice to MPI_IRecv (I got a better backtrace via:

export NVCOMPILER_TERM=debug
export NVCOMPILER_TERM_DEBUG="gdb -quiet -pid %d -x file.txt"

where file.txt just has

bt
quit

).

I then reproduced it with a much smaller program:

$ cat simple.f90
program main

  use mpi
  integer ierr, rank
  integer status(MPI_STATUS_SIZE)
  real data(2)

  data = 0
  call mpi_init(ierr)
  call mpi_comm_rank(MPI_COMM_WORLD, rank, ierr)
  if (rank == 0) then
     call mpi_send(data(1:2), 2, MPI_REAL, 1, 0, MPI_COMM_WORLD, ierr)
  else
     call mpi_recv(data(1:2), 2, MPI_REAL, 0, 0, MPI_COMM_WORLD, status, ierr)
  endif
  call mpi_finalize(ierr)
end program main

this gives the same segfault with NVHPC 25.1 and Open MPI 5.0.7 (with mpirun -n 2 ./simple ).
Notes:

  • I reproduced this with a very basic Open MPI 5.0.7 compiled with
./configure --prefix=$HOME/nvhpc/ompi5 CC=nvc FC=nvfortran CXX=nvc++
make
make install
  • the error does not occur with the precompiled Open MPI 4.1.7 shipped with NVHPC
  • the error does not occur if I replace data(1:2) with data
  • the error does not occur if you use the old include "mpif.h" instead of use mpi
  • the error also occurs if you use use mpi_f08 with the necessary type adjustments

So indeed, this could be a bug in either NVHPC, Open MPI, or some missing compilation/configure flag I’m not aware of?

1 Like

Excellent! Thanks Bart! It’s very much appreciated that you were able to reduce this down to a simple test case.

In looking through our issue reports, I do see several related with OpenMPI 5 which are actively being worked on.

I added this issue as TPR #37257.

-Mat

2 Likes

Just to update: NCAR folks were able to reproduce the segfault on their systems using nvhpc/25.1 and openmpi5. But when they use nvhpc/25.1 and cray-mpich/8.1.29 runs complete successfully. This suggests that this issue is a bug with openmpi.

Hi Bart, Neil,

Engineering let me know that TPR #37257 was fixed in our 25.5 release and I verified that the example test works with MPI 5.0.

When you get a chance, can you try building EarthWorks with 25.5?

Thanks,
Mat

1 Like

They also noted that we have a second issue for you, a memory leak TPR #37441, that we’re still working on. Hopefully a fix for this will be available in the near future.

1 Like