Segmentation fault--reason unknown

I have a trouble in running a PGI compiled Fortran program on Cray XT4. This is an MPI program compiled with PGI 7.1.4, 7.2.2 and 7.2.3. The code has a segmentation fault in a subroutine. Compiled with -O0 and -g, the core dump points to an ENDIF in the subroutine. Around the ENDIF are just simple assigment statements where I can’t find anything wrong.

The subroutine takes in 103 arguments among which around 70 arguments are double precision arrays with 1400 up to 5900 elements each (depending on the number of processes). The total size of argments is a few MB per process so it’s not like a memory overflow. It’s also not like a memory violation as all arguments are well declared.

The code can successfully run if compiled by gcc on the same XT4 machine.

With these observations, I can’t find what causes the segfault. Does anybody have any clue of the possible causes? Is there any possible situation for PGI compiled code to report a segfault? Your ideas are appreciated.

Hi yus,

I’m not sure but wonder if it’s a stack overflow. Try settng your stack size to unlimited and see if it works around the problem.

  • Mat

Hi Mat,

I set ulimit -s unlimited -c unlimited. The code still has the same segmentation fault.

Hi yus,

Without looking at the code and running it through PGDBG, it’s very difficult for me to know for sure what’s going on. But some things to look for are writing off the end of an array or in the case automatics or pointers, a compiler temporary array being created could be given the wrong array subsection and overwrite protected memory. In either case, different compilers may exhibit different behavior depending upon how memory is laid out. Try adding “-Mbounds” to check for array bounds violations. It doesn’t catch all cases, but wont hurt to try.

Also, does the error occur even if with a single thread? If the error only occurs with multiple threads, then the “ENDIF” error could be a red-herring and the seg fault is occurring in a different thread. If this is case, try using the MPICH which accompanies the PGI compilers and use PGDBG to debug the application (please refer to the PGI Tools guide on how to use PGDBG with MPI applications). This may give better insight to where the actual error occurs.

  • Mat

Hi Mat and other experts,

I have some new findings in debugging the seg fault. It seems a problem of the optimization on array operations. The seg fault doesn’t appear if compiled with -g -O0. However, it occurs when compiled with -O2. The fault happens in nested loops as follows:

DO I=1, L
! a 1D array initialization here like V=0. where V(1,L)
DO J=1,M
! also some scalar and 1D array inializations here
DO K=1,N
! Some scalar, 1D and 3D array computations here. The array subscritps depend on I, J, and K
END DO
END DO
END DO

The strange phenomenon is that the seg fault doesn’t appear (even compiled with optimization -O2) when I replace the outer loops I and J with a simple value as follows:

I=1
! initialization for index I=1 only
J=1
! initialization for index J=1 only
DO K=1, N
! The same scalar, 1D and 3D array computations for I,J=1 and K=1,N
END DO

However, the seg fault comes again if I set the middle loop J to a single iteration as:

I=1
! initialization
DO J=1,1
! initialization
DO K=1, N
! Scalar and array computations
END DO
END DO

I wonder if the fault is possibly caused by the optimization on array operations. Does anybody have any experience and advice on the problem like this?

Thanks.

Hi yus,

Can you please send an example that reproduces the problem to PGI customer support at trs@pgroup.com?

Thanks,
Mat

Hi Mat,

The “bug” happens in a subroutine but the whole package contains 800+ source files and 400MB in compressed tar. I wonder how I can pass the big package to the PGI team.

Cheers.

We have out ways ;-). Just send a note to trs@pgroup.com and we’ll give you instructions on how to upload files to our ftp server. Be sure to include instructions on how to build and run you application and include any needed data files.

Thanks,
Mat

Hi Mat,

I’ve got the instruction from trs@pgroup.com about uploading files . However, I failed to ftp the file to your server with the error:

553 Could not create file.

I tried ftp to the server and subdirectory from three computers (Windows and Linux). All showed this error. I have contacted trs@pgroup.com about the problem but haven’t got an answer. Do you know what’s wrong with it?

Thanks.

Hi yus,

I’ve received the source and will begin looking at it today or early next week.

Thanks,
Mat

Hi Mat,

Is there any luck to track down the possible cause of the segmentation fault in my submitted code?

Cheers.

Hi yus,

Sorry this one is has taken me so long. I have been working on it as a back ground task for the past few weeks.

What’s happening is the derived type pointer called “density” is being hoisted out of the loop. The descriptor for “denisty” is being set to NULL somehow so when it’s dereferenced it causes a seg fault. Note that while denisity’s value is NULL, it’s descriptor should be a legal value.

I’m currently try to determine if the stack is being corrupted (like from an array bounds error) to cause the descriptor to be set to zero. This might be a red herring, but that what I’ll be investigating today.

FYI, you should take a look at actual you’re passing. You map RMEM(RPT) to VecY and VecZ, BIGM to C1T, C2T, C3t, and BIGM. This is illegal Fortran and can cause inconsistent results. Fortran allows the compiler to assume arrays are disjoint and hence reorder some operations. Since C1T, C2T, C3T are all accessing the same array, you can get different answers depending on the order of operations. I don’t think it’s causing the seg fault, but is something to be aware of.

  • Mat

Thanks Mat, you rock! I will check these points.

Hi yus,

Below is the portion of the code that’s causing the problem. I’ve changed most of the names since this a public forum.

          if (guard) then
             MYARR(:,:,gi)=get_values( &
                  density=get_val(density,i),&
                  topdis=get_val(topdis,ele),&
                  botdis=get_val(botdis,ele),&
                  d=eget_val(d,ele))
          end if

In order to pass the results of the “get_val” function to the “get_values” function, the compiler must first create temporary arrays since “get_val” returns an array of REALs. At “-O2” the compiler determines that the creation of these temporary arrays need only be done once so hoists this code out of the DO loop.

In this case the size of the temporary array is determined by a value found in the derived type itself, for example: “density%mesh%shape%loc”. Normally since density is invariant, the hoist should still be ok, even with the guarded if statement. However, density is actually a pointer and unless the guard condition is true, is unassociated. So when density gets dereferenced to get the size of the temporary array, a seg fault occurs.

This is a compiler bug and once I’m able to put together a smaller test case, I’ll create a technical problem report.

As for a workaround, you’ll need to compile this file at -O0. However, given your build process, you can also do something like the following:

          if (guard) then
	     CALL KLUGE(density%mesh%shape%loc,topdis%mesh%shape%ngi,botdis%mesh%shape%ngi, d%mesh%shape%ngi)
             MYARR(:,:,gi)=get_values( &
                  density=get_val(density,i),&
                  topdis=get_val(topdis,ele),&
                  botdis=get_val(botdis,ele),&
                  d=eget_val(d,ele))
          end if

.... add to the bottom of the module
  SUBROUTINE KLUGE (A,B,C,D)
     INTEGER :: A,B,C,D
  END SUBROUTINE KLUGE

I’ll keep you informed how progress goes on the this bug,. Hopefully it’s not too hard to fix and we can have it working correctly by January’s 8.0-3

  • Mat

Hi Mat,

Many thanks for all your efforts in debugging such a sophisticated problem. That’s really brilliant!

All the best.

yus

FYI, this bug has been entered under TPR#15442.

  • Mat

Is there a place for users to see this report?

Unfortunately no. Please feel free to ping trs@pgroup.com for updates.

Hi Yus,

This problem (TPR#15442) has been fixed in the 8.0-3 release.

  • Mat

Mat,

Many thanks for all your help and efficient update. That’s abosolutely fantastic!

Best wishes.

Yus