SIGSEGV: 11

Dear Portland Group

I have been using the Portland Group Compiler
(pgf90) for quite some time now. Recently we decided to upgrade the compiler
from version 4.0-2 to the newer version 5.2-4, in order to compile and
execute an atmospheric model (Eta/NCEP model). The model is fully
parallelized and runs using MPICH-1.2.0.

However there is a major problem. When I compile the model using the 5.2
compiler (no errors during compilation) the execution begins normally, but
when the time comes to write the first output file the following error is
shown (bolded letters) and the program crashes:

.
.
.
CALL MPI_ISEND… 4130350 0
CALL MPI_ISEND… 4069612 0
CALL MPI_ISEND… 4108003 0
leaving CHKOUT!!!
leaving CHKOUT!!!
leaving CHKOUT!!!
EBU: TIMESTEP NTSD= 41 FCST TIME= 3600.0
leaving CHKOUT!!!
num_procs = 2
task id, jsta, end = 0 0 1
task id, jsta, end = 1 2 3
ME, MY_ISD,MY_IED,MY_JSD,MY_JED = 0 1 135
1 109
jsta_i,jend_i,jsta_im,jend_im,jsta_im2,jend_im2= 1 107
2 107 3 107
ihour in quilt = 0
Writing in …/…/output/RUN/v_out.000.dat
Writing in …/…/output/RUN/v_out.000.init
Writing in …/…/output/RUN/v_out.000.datFIELDS
Writing in …/…/output/RUN/v_out.000.ground
Writing in …/…/output/RUN/v_out.000.Ddat

ihour in quilt = 0
p4_11721: p4_error: interrupt SIGSEGV: 11
rm_l_4_11738: p4_error: interrupt SIGINT: 2
num_procs = 2
ME, MY_ISD,MY_IED,MY_JSD,MY_JED = 1 1 135
106 213
jsta_i,jend_i,jsta_im,jend_im,jsta_im2,jend_im2= 108 213
108 212 108 211

p5_11741: p4_error: Found a dead connection while looking for messages: 4
rm_l_5_11758: p4_error: interrupt SIGINT: 2
p1_11661: p4_error: Found a dead connection while looking for messages: 4

MYPE in calculation of max: 2
P: 2 Size: 1Dust Load(gr/m2)= 0.048253 at 62 2
P: 2 Size: 2Dust Load(gr/m2)= 0.121718 at 62 2
P: 2 Size: 3Dust Load(gr/m2)= 0.066811 at 62 2
P: 2 Size: 4Dust Load(gr/m2)= 0.004337 at 62 1
P: 2 Total Dust Load(gr/m2)= 0.240015 at 62 2
P: 2 Size: 1Dust Dep(mgr/m2)= 3.00 at 61 5
P: 2 Size: 2Dust Dep(mgr/m2)= 15.77 at 55 4
p2_11681: p4_error: Found a dead connection while looking for messages: 1
rm_l_2_11698: p4_error: interrupt SIGINT: 2
rm_l_1_11678: p4_error: interrupt SIGINT: 2
2.68 at 65 8
P: 3 Size: 3Dust Dep(mgr/m2)= 7.99 at 65 8
P: 3 Size: 4Dust Dep(mgr/m2)= 8.04 at 64 16
P: 3 Total Dust dep(mgr/m2)= 24.64 at 65 8
TSHLTR initially: 295.2062
TSHLTR becoming: 293.3899
p3_11701: p4_error: Found a dead connection while looking for messages: 1
rm_l_3_11718: p4_error: interrupt SIGINT: 2
P4 procgroup file is /mnt/space17/cspir/fine/worketa_all/eta/runs/machines.
bm_list_11658: p4_error: net_recv read: probable EOF on socket: 1
2 Size: 3Dust Dep(mgr/m2)= 12.23 at 55 4
P: 2 Size: 4Dust Dep(mgr/m2)= 3.07 at 59 1
P: 2 Total Dust dep(mgr/m2)= 30.00 at 55 4
TSHLTR initially: 284.2811
TSHLTR becoming: 284.3562
.
.
.
.

The model is run on 4 processors with the following specs (all the same):

CPU: Dual Xeon 3.2MHz (64-bit)
Memory: 2GB
Linux Distribution : Fedora Core 4 - Kernel 2.6.16 - 32-bit

I am also sending you the options we use:

LIBS = -L/usr/local/mpich-1.2.0/lib -lmpich -lfmpich -lmpichf90
FFLAGS = -fast -DLITTLE -lc -lgcc_eh -Wl,-static

I have tried a number of different options (including no options at all) and
the same thing happens. However when I compile the model using the 4.0
compiler (with the same options), everything works fine and the program is
executed with no errors!
I also tried “ulimit -s unlimited” and still the same.
I also tried using the latest version of MPI (1.2.7p1) and still crashes.

Can you please help me? Is there an option or something i can use to fix this?

Thank you for your time,
Christos

Hi Christos,

Although I’m not sure, isn’t Eta/NCEP a data model that’s used with MM5 and WRF, but not something compiled itself? If not, can you post a link where the code your using is located and instructions on how to recreate the error so I can investigate the issue? Also, I might need a data set since I have seen past MM5 problems which were data dependent.

Note that you will need to recompile NetCDF (if used) and may need to recompile MPICH if the F90 modules are used by your program. F90 modules are generally not compatible from one major release to another.

Also, I want to confirm that while you are running on a 64-bit system, the compilers and OS you are using are 32-bits

  • Mat

Found it! It was an array getting out of bounds. Apparently the 5.2 version of pgf90 is more strict with array bounds. The option -Mnobounds does the trick! Thank you for your interest anyway!

Hi Christos,

Array bounds errors can be extremely difficult bugs since they can cause random failures. Disable this only if your convinced that the error is ok. Just because it “works” doesn’t mean yoru application won’t failure or give wrong answers later.

  • Mat