I am running a model built using OpenMPI 5.0.7 and NVHPC 25.1. It builds successfully, but I get the following runtime error:
[ng10503:3038313:0:3038313] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x8a0aec8)
==== backtrace (tid:3038331) ====
0 0x0000000000038790 __sigaction() ???:0
1 0x00000000003b26cb pgf90_set_intrin_type_i8() ???:0
The backtrace indicates the fault is occurring with pgf90_set_intrin_type_i8(), so I am wondering if this is a compiler issue. This error occurs for both CPU runs and GPU runs. This error also occurs when using NVHPC 25.1 with OpenMPI 5.0.3.
This error does not occur for CPU runs with OpenMPI 4.1.5 and NVHPC 23.9. However I am unable to perform GPU runs with this combination because of limited GPU support in OpenMPI 4.x, so I am attempting to work with more recent versions of NVHPC and OpenMPI.
Chris let me know that one of our colleagues found a bug in 5.0.6 which causes build failures, and why he wanted to get more info about your build on the previous post.
Now this is a runtime failure and youāre using 5.0.7, so Iād be inclined to say itās a compiler issue, but given it only occurs in 5.0.7, it could be an OpenMPI issue. Difficult to say without a way to reproduce the error.
As I mentioned in my OP, the error occurs with both OpenMPI 5.0.3 and 5.0.7.
The climate model I am running is Earthworks, which has lots of dependencies. I can walk you through setting it up if you want to try to reproduce the error on your end.
The climate model I am running is Earthworks, which has lots of dependencies. I can walk you through setting it up if you want to try to reproduce the error on your end.
Let me see if Chris can get me a build of OpenMPI 5 to run against my MPI benchmarks to see if anything pops up. If I canāt recreate it there, Iāll then try to build Earthworks.
Ok, thanks for running those tests. But before we try to compile Earthworks on your system, I was thinking of trying to run a simple program that reproduces the segmentation fault. So I created the following C program:
It compiles successfully but produces a segmentation fault at runtime. But that could be just because pgf90_set_intrin_type_i8 hasnāt been declared or used correctly. So is my declaration/usage of pgf90_set_intrin_type_i8 correct? I do not have access to the source code that generated libnvf.so (itās not open source after all :), so I donāt know the interface for pgf90_set_intrin_type_i8. Based on the contents of Linux_x86_64/25.1/compilers/lib/libnvf.ipl in the nvhpc installation package, the source for libnvf.so appears to be src/libpgf90.c.
āpgf90_set_intrin_type_i8ā is a Fortran runtime call used to set the type of a polymorphic object. I canāt really be called here and youāre missing the arguments (i.e. a Fortran descriptor and the intrinsic type), which in turn is causing the segv.
This is why I would need to know which component from EarthWorks the error is coming from and Iāll see if I can build and run it to reproduce it. I have access to several of these, like MPAS and CAM, though I didnāt think either of these use polymorphic types. So itās likely coming from one of the others.
(itās not open source after all :)
Yes and no. We open sourced our Fortran front-end to LLVM and is whatās used by Flang. Weāre also working with LLVM community on a new Fortran front-end compiler named F18. We use LLVM for the back-end compiler as well. Though youāre correct that our current Fortran runtime is not open, though will be (at least derived from) once F18 is released.
Ok. I built Earthworks 2.4.001 with the following dependencies:
openmpi/5.0.7
gcccore/13.3
nvhpc/25.1
cuda/12.6.2
hwloc/2.10.0
ucx/1.16.0
libfabric/1.21.0
pmix/5.0.2
prrte/3.0.5
ucc/1.3.0
gdrcopy/2.4.1
nccl/2.22.3
flexiblas/3.4.4
blis/1.0
perl/5.36.1
libxml/2.0208
python/3.11.5
cmake/3.31.0
szip/2.1.1
hdf5/1.13.2
pnetcdf/1.14.0
netcdf-c/4.9.2
netcdf-fortran/4.6.1
parallelio/2.6.2
esmf/8.7.0
If ESMF is not already built on your system, let me know, and I can send you more info on building that.
If you unpack the Earthworks code to my_earthworks_sandbox, you will have to set up the environment for your machine in my_earthworks_sandbox/ccs_config/machines/{$MachineName} where $MachineName is the name of your machine. This folder should contain three files: config_batch.xml, config_machines.xml, and nvhpc_{$MachineName}.cmake. If you have previously set up CESM, then the process is similar (as Earthworks is based on CESM), and you can look at examples from other machines in my_earthworks_sandbox/ccs_config/machines.
In the create_newcase command above, adjust the number of GPUs per node as needed. Also, I am using NTASKS=36 above just to be sure you donāt have issues with input datasets, but you could use a different value if you know what youāre doing.
If the build process fails because phys_control.mod canāt be found, add phys_control.o to line 568 of my_earthworks_sandbox/cime/CIME/Tools/Makefile and rebuild.
$CasePath/CaseStatus will show you the status of the run. If the run fails, this file will indicate the path of the log file containing the error. In my case, the segmentation fault occurred in cesm.log.* within the run directory. So the error isnāt coming from one of the individual model components, but rather from the parent process for the fully coupled model.
Ok, Iāll give it a try. Iāve built CESM before so know the basics, it just takes some effort getting all the dependencies built. Might take me some time.
Though do you know which component shows the segv? Again, I have access to several of these, like MPAS and CAM, so it might be easier if try to recreate it outside of EarthWorks first.
@MatColgrove The segv is occurring in CESM, which is the parent process for the folly coupled model. The segv is not occurring in an individual model component.
Hi again Mat - I have conducted a number of additional tests, and the failures are following the pattern found on NCARās Derecho system, as documented here:
That is, the failures are occurring just with configurations that have atmosphere coupled to land. On Derecho, atmosphere+land configurations produce memory leaks (rather than a segfault) when using nvhpc 24.7 through 24.11, but no such issues arise when using the intel compiler.
The workaround on Derecho has been to revert to NVHPC 24.3. I have asked the admins of the system Iām using (narval) if they are able to set up NVHPC 24.3 or older with openmpi5. But if you end up finding another workaround/solution, please let me know.
Indeed, our tests with standalone atmosphere (CAM/MPAS) and standalone land (CLM) run fine with NVHPC. Itās when atmosphere is coupled to land that the model does not get along with recent NVHPC versions.
Are you running CAM within CESM? If so, running CAM coupled to CLM would just involve changing the compset name.
Hi,
Iām an analyst working with the supercomputer where Neil runs his code and had some time to debug it. I was able to narrow it down to passing an array slice to MPI_IRecv (I got a better backtrace via:
Just to update: NCAR folks were able to reproduce the segfault on their systems using nvhpc/25.1 and openmpi5. But when they use nvhpc/25.1 and cray-mpich/8.1.29 runs complete successfully. This suggests that this issue is a bug with openmpi.
They also noted that we have a second issue for you, a memory leak TPR #37441, that weāre still working on. Hopefully a fix for this will be available in the near future.