Segmentation Fault

Hi all!

We run an operational weather forecasting model on a Cray XT4 (catamount target, crosscompiled on a linux host) and we’ve been getting rather strange segmentation faults with our code lately. Here one example…

----- DEBUG: PCB, CONTEXT, STACK TRACE ---------------------

PROCESSOR [ 0]
log_nid = 276 phys_nid = 0x198 host_id = 8 host_pid = 1948
group_id = 31329 num_procs = 804 rank = 276 local_pid = 3
base_node_index = 0 last_node_index = 401

text_base = 0x00000000200000 text_len = 0x00000001000000
data_base = 0x00000001200000 data_len = 0x00000008a00000
stack_base = 0x0000007ec00000 stack_len = 0x00000001000000
heap_base = 0x00000009e00000 heap_len = 0x00000032600000

ss = 0x000000000000001f fs = 000000000000000000 gs = 0x0000000000000017
rip = 0x00000000003d3547
rdi = 0x000000007ffffe35 rsi = 0x00000000000001cb rbp = 0x000000007fffffff
rsp = 0x000000007fbf7bc0 rbx = 0x0000000080000002 rdx = 0x0000000000000005
rcx = 0x0000000000000071 rax = 0x000000000000001b cs = 0x000000000000001f
R8 = 0x000000007ffffcbf R9 = 0x00000000000001e7 R10 = 0x000000000fe80210
R11 = 0x0000000080000000 R12 = 0x0000000000000005 R13 = 0x000000007ffffe8a
R14 = 0x0000000009490a1c R15 = 0x000000000fe80210
rflg = 0x0000000000010a12 prev_sp = 0x000000007fbf7bc0
error_code = 4

SIGNAL #[11][Segmentation fault] fault_address = 0x000000040fe7f660


Stack Trace: ------------------------------
#0 0x00000000003d3547 in numeric_utilities_interpol_sl_tricubic_()

… the code runs more than 10 times a day and usually without such problem. The routine where the code bugs is a rather complex tricubic interpolation…

wx = - w(i,j,k,1)
wy = - w(i,j,k,2)
wz = - w(i,j,k,3)

ap1z = ap1(wz)
a0z = a0 (wz)
am1z = am1(wz)
am2z = am2(wz)

! tricubic interpolation of the transported field
s_new(i,j,k) = &
& ap1(wx) * ( ap1(wy) * ( ap1z * s_old(ip1, jp1, kp1 ) &
& + a0z * s_old(ip1, jp1, k0 ) &
& + am1z * s_old(ip1, jp1, km1 ) &
& + am2z * s_old(ip1, jp1, km2 ) ) &
& + a0 (wy) * ( ap1z * s_old(ip1, j0 , kp1 ) &
& + a0z * s_old(ip1, j0 , k0 ) &
& + am1z * s_old(ip1, j0 , km1 ) &
& + am2z * s_old(ip1, j0 , km2 ) ) &
& + am1(wy) * ( ap1z * s_old(ip1, jm1, kp1 ) &
& + a0z * s_old(ip1, jm1, k0 ) &
& + am1z * s_old(ip1, jm1, km1 ) &
& + am2z * s_old(ip1, jm1, km2 ) ) &
& + am2(wy) * ( ap1z * s_old(ip1, jm2, kp1 ) &
& + a0z * s_old(ip1, jm2, k0 ) &
& + am1z * s_old(ip1, jm2, km1 ) &
& + am2z * s_old(ip1, jm2, km2 ) ) )

s_new(i,j,k) = s_new(i,j,k) + &
& a0 (wx) * ( ap1(wy) * ( ap1z * s_old(i0 , jp1, kp1 ) &
& + a0z * s_old(i0 , jp1, k0 ) &
& + am1z * s_old(i0 , jp1, km1 ) &
& + am2z * s_old(i0 , jp1, km2 ) ) &
& + a0 (wy) * ( ap1z * s_old(i0 , j0 , kp1 ) &
& + a0z * s_old(i0 , j0 , k0 ) &
& + am1z * s_old(i0 , j0 , km1 ) &
& + am2z * s_old(i0 , j0 , km2 ) ) &
& + am1(wy) * ( ap1z * s_old(i0 , jm1, kp1 ) &
& + a0z * s_old(i0 , jm1, k0 ) &
& + am1z * s_old(i0 , jm1, km1 ) &
& + am2z * s_old(i0 , jm1, km2 ) ) &
& + am2(wy) * ( ap1z * s_old(i0 , jm2, kp1 ) &
& + a0z * s_old(i0 , jm2, k0 ) &
& + am1z * s_old(i0 , jm2, km1 ) &
& + am2z * s_old(i0 , jm2, km2 ) ) )

s_new(i,j,k) = s_new(i,j,k) + &
& am1(wx) * ( ap1(wy) * ( ap1z * s_old(im1, jp1, kp1 ) &
& + a0z * s_old(im1, jp1, k0 ) &
& + am1z * s_old(im1, jp1, km1 ) &
& + am2z * s_old(im1, jp1, km2 ) ) &
& + a0 (wy) * ( ap1z * s_old(im1, j0 , kp1 ) &
& + a0z * s_old(im1, j0 , k0 ) &
& + am1z * s_old(im1, j0 , km1 ) &
& + am2z * s_old(im1, j0 , km2 ) ) &
& + am1(wy) * ( ap1z * s_old(im1, jm1, kp1 ) &
& + a0z * s_old(im1, jm1, k0 ) &
& + am1z * s_old(im1, jm1, km1 ) &
& + am2z * s_old(im1, jm1, km2 ) ) &
& + am2(wy) * ( ap1z * s_old(im1, jm2, kp1 ) &
& + a0z * s_old(im1, jm2, k0 ) &
& + am1z * s_old(im1, jm2, km1 ) &
& + am2z * s_old(im1, jm2, km2 ) ) )

s_new(i,j,k) = s_new(i,j,k) + &
& am2(wx) * ( ap1(wy) * ( ap1z * s_old(im2, jp1, kp1 ) &
& + a0z * s_old(im2, jp1, k0 ) &
& + am1z * s_old(im2, jp1, km1 ) &
& + am2z * s_old(im2, jp1, km2 ) ) &
& + a0 (wy) * ( ap1z * s_old(im2, j0 , kp1 ) &
& + a0z * s_old(im2, j0 , k0 ) &
& + am1z * s_old(im2, j0 , km1 ) &
& + am2z * s_old(im2, j0 , km2 ) ) &
& + am1(wy) * ( ap1z * s_old(im2, jm1, kp1 ) &
& + a0z * s_old(im2, jm1, k0 ) &
& + am1z * s_old(im2, jm1, km1 ) &
& + am2z * s_old(im2, jm1, km2 ) ) &
& + am2(wy) * ( ap1z * s_old(im2, jm2, kp1 ) &
& + a0z * s_old(im2, jm2, k0 ) &
& + am1z * s_old(im2, jm2, km1 ) &
& + am2z * s_old(im2, jm2, km2 ) ) )

! Tendency ds/dt:
! sten(i,j,k) = ( s_new(i,j,k) - s_old(i,j,k) ) / 2.0 / dt

… since the code is rather highly optimized…

FTNOPTS = -Mpreprocess -O3 -Mvect=sse -Mvect=noassoc
-Kieee -Mscalarsse -Mcache_align
-Mflushz -Mlre -Mprefetch -Mbyteswapio -Mipa=fast
-I/opt/xt-mpt/default/mpich2-64/P2/include

…it is not possible to located the exact line of the error. Also, since the code takes 30 minutes to execute on ~800 processors, running it without optimization and debugging options in a debugger is not really a great option. Strangely, the segfault only occurs sporadically and not reproducibly. I’ve tried playing around with stack sizes (in the yod call) but I haven’t really got enough experience to do this knowledgeably and my tries were without success.

I would be glad for any suggestions in how to go about this problem. If you need any additional information, don’t hesitate to ask.

Thanks and kind regards from sunny Switzerland,
Oliver

pgf90 -V
pgf90 7.0-3 64-bit target on x86-64 Linux

I forgot to mention one more thing. Whenever the code segfaults, it does it at exactly the same location in the code, in the tricubic interpolation routine somewhere within the lines of code I’ve posted on the original post. This indicates, that there is really something fishy with the code and that this issue is not related to some nodes crashing or bugging.

Thanks,
Oli

Hi again,

I just realized that one of the input files was changing as a function of time. Now I’ve isolated a run which reproducibly segmentation dumps and I have a core file. I’ve inspected the core file (of the optimized code) and can extract the following info…

Disassembly around the line where the segfault occurs (0x003d3547 with the mulsd instruction, this is to my knowledge a SEE2 multiplication of two float doubles):

0x003d3534: 0x44
0x003d3535: 0x0f
0x003d3536: 0x13
0x003d3537: 0xa4
0x003d3538: 0x24
0x003d3539: 0xe8
0x003d353a: 0x00
0x003d353b: 0x00
0x003d353c: 0x00
0x003d353d: 0x66 movlpd %xmm8,208(%rsp)
0x003d353e: 0x44
0x003d353f: 0x0f
0x003d3540: 0x13
0x003d3541: 0x84
0x003d3542: 0x24
0x003d3543: 0xd0
0x003d3544: 0x00
0x003d3545: 0x00
0x003d3546: 0x00
0x003d3547: 0xf2 mulsd (%r10,%r13,8),%xmm4
0x003d3548: 0x43
0x003d3549: 0x0f
0x003d354a: 0x59
0x003d354b: 0x24
0x003d354c: 0xea
0x003d354d: 0x41 movl %edi,%r13d

Registers for the frame:

%rax: 0x0000001b (27)
%rdx: 0x00000010 (16)
%rcx: 0x0000007c (124)
%rbx: 0x80000002 (2147483650)
%rsi: 0x000001cb (459)
%rdi: 0x7ffffe35 (2147483189)
%rbp: 0x7fffffff (2147483647)
%rsp: 0x7fbf7bc0 (2143255488)
%r8: 0x7ffffcca (2147482826)
%r9: 0x000001e7 (487)
%r10: 0x0fec3640 (267138624)
%r11: 0x80000000 (2147483648)
%r12: 0x00000010 (16)
%r13: 0x7ffffe95 (2147483285)
%r14: 0x00000243 (579)
%r15: 0x0fec3640 (267138624)
%ra: 0x0000001f (31)
%ss: 0x00000017 (23)
%ds: 0x4000089665 (274878469733)
%es: 0x4000089620 (274878469664)
%fs: 0x00000000 (0)
%gs: 0x00000000 (0)
%eflags: 0x00010a06 (IOPL=0,PF+IF+OF+RF)
%rip: 0x003d3547 (numeric_utilities_interpol_sl_tricubic_+0x747)
%fs_base: 0x40005ff768 (274884196200)
%gs_base: 0x00000001 (1)
%temp: 0xf0f0f0f0f00ff0f0 (-1085102592585895696)

Floating point registers:

%st0: 0
%st1: 0
%st2: 0
%st3: 0
%st4: 0
%st5: -0
%st6: 0.994934887168946
%st7: 1.28245142253167e+15
%fpcr: 0x000e (RC=RN+PC=SGL+EM=(OM+ZM+DM))
%fpsr: 0x0000 (TOP=0+EF=())
%fptag: 0x0003 (0:E,1:V,2:V,3:V,4:V,5:V,6:V,7:V)
%fpop: 0x0000
%fpi: 0x00000040005ff730
%fpd: 0x0000000000000000
%mxcsr: 0x005ff740 (RC=RZ+FZ+DAZ+EM=(PM+OM+ZM+DM)+EF=())
%mxcsr_mask: 0x0000004f (RC=RN+DAZ+EM=()+EF=(OE+ZE+DE+IE))
%xmm0_l: 0x00000040005ff938
%xmm0_h: 0x0000000000000000
%xmm1_l: 0x000000400008bd30
%xmm1_h: 0x00000040000834c0
%xmm2_l: 0x0000004000037e5f
%xmm2_h: 0x0000000000000000
%xmm3_l: 0x0000004000021fe9
%xmm3_h: 0x000000000000003d
%xmm4_l: 0x00000040005ff928
%xmm4_h: 0x000000400008353d
%xmm5_l: 0x0000004000083500
%xmm5_h: 0x0000004000083500
%xmm6_l: 0x0000004000029862
%xmm6_h: 0x0000000000000000
%xmm7_l: 0x0000000a00000000
%xmm7_h: 0x0000004000082f40
%xmm8_l: 0x0000000500000005
%xmm8_h: 0x00000040005ff850
%xmm9_l: 0x0000000000000000
%xmm9_h: 0x0000004300000401
%xmm10_l: 0xffffffffffffffff
%xmm10_h: 0x0000000000000000
%xmm11_l: 0x00000040005ffa08
%xmm11_h: 0x000000400008347a
%xmm12_l: 0x00000040005ff9a8
%xmm12_h: 0x0000000000000000
%xmm13_l: 0x00000040000891ba
%xmm13_h: 0x00000000000891ba
%xmm14_l: 0x0000000030789862
%xmm14_h: 0x0000000a00000000
%xmm15_l: 0x0000000100000000
%xmm15_h: 0x0000000000000000

Hi Oli,

From your registers dump, r13 looks suspicious to me, it is rather large. Perhaps compile with -gopt might help you look at the line number better.
Can you reduce the optimization flags such as -Mprefetch -Mipa=fast?
Did you try with 7.1-5? If there is a bug in 7.0-3, we could have fixed it in latest releast.

Hongyon