segmentation fault in libpgc.so

I have a piece of code that is giving my a Segmenation Fault that I cannot track down. When I run in the gdb debugger and do a backtrace after the segfault I get:

#0 0x4014f1a6 in __utl_i_div64 () from /usr/pgi_old/linux86/5.2/lib/libpgc.so
#1 0x0807a03b in __hpf_i64toax ()
#2 0x0808c574 in conv_int8 ()
#3 0x0808c39a in __hpfio_fmt_i8 ()
#4 0x0808ba06 in __hpfio_default_convert ()
#5 0x08075798 in _f90io_ldw ()
#6 0x08075999 in pgf90io_ldw ()
#7 0x0804bf3a in clustering_module_kfof
() at clustering.f:227

So the actual error is occurring in libpgc.so. I can’t find any documentation or any way to track this down further. Compiling with ‘-Mbounds’ gives me nothing. I will not be surprised if the error is resulting from some bug in my code, but since the actual segfault is in a PGI library, I don’t know how to proceed with debugging.

The ‘kfof’ subroutine is a recursive routine in a module called ‘clustering_module’ found in the source file ‘clustering.f’. By the time in the code that it segfaults, it has already successfully called kfof recursively hundreds of thousands of times. In fact, given a slightly different data set to process, this code successfully runs with over over 8M calls to this routine.

This code was compiled using pgf90 5.2-2 on a machine running Suse Linux - 2.6.8.1-suse91-i4smp

If anyone can help, I would be extremely appreciative.

Hi,

I am not really sure what could be a problem. Did you try with the latest release? What option do you use to compile?

Hongyon

Also, please make sure you don’t get a stack overflow as you do recursive call that many times.

Hongyon

The original post was from code compiled with simply ‘-g -O0 -Mbounds’.

I have installed the latest version of PGI (7.1-1) and compiled with ‘-g -O0 -Mbounds -Ktrap=fp -Mchkfpstk -Mchkstk -Mrecursive’. I get a very similar, but not identical error. The segfault occurs at roughly the same spot, and a backtrace gives me:

#0 0x4001f986 in __utl_i_div64 ()
from /home/pgi_new/linux86/7.1-1/lib/libpgc.so
#1 0x08086da0 in __hpf_i64toax ()
#2 0x08099225 in conv_int8 ()
#3 0x08099054 in _hpfio_fmt_i8 ()
#4 0x080986c2 in hpfio_default_convert ()
#5 0x08081825 in f90io_ldw ()
#6 0x08081a33 in pgf90io_ldw ()
#7 0x0804c69a in clustering_module_kfof
() at clustering.f:230

#89465 0x0804dac6 in clustering_module_kfof
() at clustering.f:303
#89466 0x08051449 in clustering_module_startfof
() at clustering.f:581
#89467 0x08050460 in clustering_module_build_fof
() at clustering.f:520
#89468 0x0804a312 in clustering_module_clustering
() at clustering.f:114
#89469 0x0806b786 in dbgals () at dbgals.f:285
#89470 0x08049e9b in main ()

Currently, I cannot rule out that the underlying issue is a stack overflow. How might I investigate that possibility further?

Hi,

To unlimit stack type: unlimit on Suse system, then type limit to check if it really does the job. This should unlimit your stack size but still under the hard limit.

Hongyon

Increasing the stacksize seems to have done the trick. It seems that the code was segfaulting in the libpgc routines simply because it was near the stack limit due to the large number of recursive calls, but happened to be actually hitting the limit while calling the libpgc routines.

Thanks for the help.