Error with 12.3, not with 12.2. Bug?

Dear PGI support,

my company recently bought a PGI license after some internal evaluations. I am happy about the compiler and the overall suite (especially OpenACC!!!). I personally did some tests by compiling my application using PGI 12.2. No problems. But the version we installed after getting the license is 12.3. At run-time, I have a error during I/O operation. GDB reports this:


[fspiga@gemini1 PW-AUSURF112]$ gdb …/espresso/bin/pw.x core.44246
GNU gdb (GDB) Red Hat Enterprise Linux (7.2-50.el6)
Copyright © 2010 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type “show copying”
and “show warranty” for details.
This GDB was configured as “x86_64-redhat-linux-gnu”.
For bug reporting instructions, please see:

Reading symbols from /ichec/home/staff/fspiga/QE/espresso/bin/pw.x…done.
[New Thread 44246]
Missing separate debuginfo for
Try: yum --disablerepo=’’ --enablerepo=’-debuginfo’ install /usr/lib/debug/.build-id/15/aeeb89cdee58e81ee8e0ccc5f7c79dac280dcf
Reading symbols from /lib64/libpthread.so.0…(no debugging symbols found)…done.
[Thread debugging using libthread_db enabled]
Loaded symbols for /lib64/libpthread.so.0
Reading symbols from /lib64/librt.so.1…(no debugging symbols found)…done.
Loaded symbols for /lib64/librt.so.1
Reading symbols from /lib64/libm.so.6…(no debugging symbols found)…done.
Loaded symbols for /lib64/libm.so.6
Reading symbols from /lib64/libc.so.6…(no debugging symbols found)…done.
Loaded symbols for /lib64/libc.so.6
Reading symbols from /lib64/ld-linux-x86-64.so.2…(no debugging symbols found)…done.
Loaded symbols for /lib64/ld-linux-x86-64.so.2
Core was generated by `…/espresso/bin/pw.x -input ausurf_gamma.in’.
Program terminated with signal 11, Segmentation fault.
#0 0x000000351347a7cd in realloc () from /lib64/libc.so.6
Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.47.el6_2.5.x86_64
(gdb) debuginfo-install glibc-2.12-1.47.el6_2.5.x86_64
Undefined command: “debuginfo-install”. Try “help”.
(gdb) bt
#0 0x000000351347a7cd in realloc () from /lib64/libc.so.6
#1 0x0000000001d8571a in fr_init ()
#2 0x0000000001d81919 in pgf90io_fmtr_init2003 ()
#3 0x0000000000825b26 in iotk_getline_x (unit=4,
line=’ ’ , ‘\000’ , ‘?\004\000\000\000\000\000\000???’, ‘\000’ , ‘?\017\000\000\000\000\000\000<PP_INFO>’, ’ ’ , ‘\000’ , ‘?=\016??\177\000\000?\a?\001’, ‘\000’ , ’ E\016??\177\000\000\004J\020??\177\000\000?"n\002\000\000\000\0000?\016??\177\000\000???\001’, ‘\000’ , ’ E\016??\177\000\000TA\016??\177\000\000?“n\002’, ‘\000’ , ‘tE\016??\177\000\0000C\016??\177\000\000i??’…, length=2305, ierr=0) at ./iotk_scan.F90:1016
#4 0x0000000000824562 in iotk_scan_tag_x (unit=4, direction=1, control=0, tag=’ ’ , binary=.FALSE., stream=.FALSE., ierr=0) at ./iotk_scan.F90:670
#5 0x0000000000825466 in iotk_scan_x (unit=4, direction=1, control=2, name=‘PP_INFO\000’, ’ ’ , ‘\000’,
attr=’\000’ , ‘v\222?\0225’, ‘\000’ , '?R\017??\177\000\000\000\000\000\000\000\000\000\000v\222?\0225\000\000\000\006\000\000\000\000\000\000\000\020S\017??\177\000\000fII”\000\000\000\000\020S\017??\177\000\000\006\000\000\000\000\000\000\000H?\022\026?*\000\000o<\224|\000\000\000\000\207\233?\0225’, ‘\000’ , ‘<\eD\0235\000\000\000/\000\000\0005\000\000\00003@\0235’, ‘\000’ , ‘\220T\017??\177\000\000H<@\0235\000\000\000??@\0235’, ‘\000’ , ‘\210\021?\0225\000\000\000H?U\0235\000\000\000\b?’…, binary=.FALSE., stream=.FALSE., found=.FALSE., ierr=0) at ./iotk_scan.F90:898
#6 0x0000000000822ff5 in iotk_scan_end_x (unit=4, name=‘PP_INFO\000’, ’ ’ , dummy=Cannot access memory at address 0x0
) at ./iotk_scan.F90:331
#7 0x000000000081cb6c in iotk_close_read_x (unit=4, dummy=Cannot access memory at address 0x0
) at ./iotk_files.F90:832
#8 0x000000000065b385 in read_upf_v2_module::read_upf_v2 (u=4, upf=Asked for position 0 of stack, stack only has 0 elements on it.
) at ./read_upf_v2.F90:56
#9 0x00000000006930f7 in upf_module::read_upf (upf=Asked for position 0 of stack, stack only has 0 elements on it.
) at ./upf.F90:64
#10 0x000000000064cf0a in read_pseudo_mod::readpp (input_dft=‘none’, ’ ’ , printout=Cannot access memory at address 0x0
) at ./read_pseudo.F90:150
#11 0x0000000000435018 in iosys () at ./input.F90:1267
#12 0x000000000040331a in pwscf () at ./pwscf.F90:53
#13 0x00000000004031f4 in main ()
#14 0x000000351341ecdd in __libc_start_main () from /lib64/libc.so.6
#15 0x00000000004030e9 in _start ()

Into detail, frame #3

(gdb) frame 3
#3 0x0000000000825b26 in iotk_getline_x (unit=4,
line=’ ’ , ‘\000’ , ‘?\004\000\000\000\000\000\000???’, ‘\000’ , ‘?\017\000\000\000\000\000\000<PP_INFO>’, ’ ’ , ‘\000’ , ‘?=\016??\177\000\000?\a?\001’, ‘\000’ , ’ E\016??\177\000\000\004J\020??\177\000\000?“n\002\000\000\000\0000?\016??\177\000\000???\001’, ‘\000’ , ’ E\016??\177\000\000TA\016??\177\000\000?“n\002’, ‘\000’ , ‘tE\016??\177\000\0000C\016??\177\000\000i??’…, length=2305, ierr=0) at ./iotk_scan.F90:1016
1016 read(unit,”(a)”,iostat=iostat,eor=1,size=buflen,advance=“no”) buffer
(gdb) list
1011 logical :: eor
1012 pos = 0
1013 ierrl=0
1014 do
1015 eor = .true.
1016 read(unit,"(a)",iostat=iostat,eor=1,size=buflen,advance=“no”) buffer
1017 3 continue
1018 eor = .false.
1019 if(iostat/=0) then
1020 call iotk_error_issue(ierrl,“iotk_getline”,“iotk_scan.f90”,964)

My first attempt to understand the problem was

(gdb) print unit
$2 = 4

and the PGI Fortran Reference Guide at page 336 reports

Logical units 5 (stdin) and 6 (stdout) are line buffered. Logical unit 0 (stderr) is unbuffered. Disk files are fully buffered.

so that 4 should be a 5… maybe… I am trying to figure out where the number “4” comes from.

Using other compilers (like Intel) or, as I said, PGI compiler versions below 12.3, this problem doe not appear.

Do you have any suggestion to solve it?

Hi fspiga,

Using other compilers (like Intel) or, as I said, PGI compiler versions below 12.3, this problem doe not appear.

It’s very possible that it’s compiler error in 12.3, but it could be a problem with the program. I can’t really tell which from the GDB output.

Do you have any suggestion to solve it?

First, I’d run the code using the PGI debugger, PGDBG. GDB doesn’t understand Fortran so some of the information presented may be misleading.

Also, it looks like you’re running PWSCF? Do you know the version? Which workload are you running? If I can recreate the problem here, I’ll be able to determine if the problem is with the program or with the compiler.

  • Mat

Hi mkcolg,


yes it is PWscf (repository version). I am using a very short input test (AUSURF54).

In the code I also tried to replace “unit=4” with “unit=5” or “unit=1234” but the problem persists. I am going to use PGDGB as you suggested to track the error in a more detailed way.

many thanks in advance for your reply.

Cheers,
F.

Hi F,

I downloaded espresso 4.3.2 from http://qe-forge.org/frs/?group_id=10 along with the corresponding examples. I then built PW and ran it against the examples. The only failures were due to missing data files.

However, I don’t see an input file called “AUSURF54”. Can you point me to this input?

Also, which repository version are you using? I see both a GPU enabled branch (espresso-PRACE) and the main trunk.

  • Mat

Hi Mat,

The input file is here:
http://www.fislab.disco.unimib.it/~filippo/PW-AUSURF54.tar.gz

I am using the code in the repository. You can download it by doing
$ svn checkout svn://scm.qe-forge.org/scmrepos/svn/q-e/trunk/espresso

I tried with PGDGB. I am not expert of this debugger but I think it produces the same errors with the same level of details of GDB. But I am not expert of it. Here the output:

pgdbg> debug …/espresso/bin/pw.x -input ausurf_gamma.in
Loaded: /ichec/home/staff/fspiga/QE/PW-AUSURF54/…/espresso/bin/pw.x
MAIN_
pgdbg> run
libnuma.so.1 loaded by ld-linux-x86-64.so.2.
libpthread.so.0 loaded by ld-linux-x86-64.so.2.
librt.so.1 loaded by ld-linux-x86-64.so.2.
libm.so.6 loaded by ld-linux-x86-64.so.2.
libc.so.6 loaded by ld-linux-x86-64.so.2.

Program PWSCF v.4.99 starts on 14Apr2012 at 17:26:39

This program is part of the open-source Quantum ESPRESSO suite
for quantum simulation of materials; please cite
"P. Giannozzi et al., J. Phys.:Condens. Matter 21 395502 (2009);
URL > http://www.quantum-espresso.org> ",
in publications or presentations arising from this work. More details at
http://www.quantum-espresso.org/quote.php

Serial multi-threaded version, running on 8 processor cores

Current dimensions of program PWSCF are:
Max number of different atomic species (ntypx) = 10
Max number of k-points (npk) = 40000
Max angular momentum in pseudopotentials (lmaxx) = 3
Reading input from ausurf_gamma.in
Warning: card &IONS ignored
Warning: card ION_DYNAMICS = ‘NONE’ ignored
Warning: card / ignored
Warning: card &CELL ignored
Warning: card CELL_DYNAMICS = ‘NONE’ ignored
Warning: card / ignored
Message from routine iosys:
minimal I/O required, wf_collect reset to FALSE
Signalled SIGSEGV at 0x351347A7CD, function __GI___libc_realloc, file interp.c
0x351347A7CD: 48 8B 47 F8 movq -8(%rdi),%rax

pgdbg> stacktrace

STACK TRACE:
#12 pwscf line: “pwscf.F90”@44 address: 0x403483
#11 iosys line: “input.F90”@1255 address: 0x42E72C
#10 readpp line: “read_pseudo.F90”@148 address: 0x5E9143
input_dft = 0x1E3CAF0, ERROR: Cannot read value at address 0x0.
printout =
#9 read_upf line: “upf.F90”@62 address: 0x62969E
upf = 0x356D2B0, grid = 0x356CFF0, ierr = 0, unit = 4, filename = 0x0
#8 read_upf_v2 line: “read_upf_v2.F90”@54 address: 0x5F62E1
u = 4, upf = 0x356D2B0, grid = 0x356CFF0, ierr = 0
#7 iotk_close_read_x line: “iotk_files.F90”@832 address: 0x798A50
unit = 4, dummy = 0x0, ierr = 0
#6 iotk_scan_end_x line: “iotk_scan.F90”@331 address: 0x79ECBB
unit = 4, name = 0x7FFFFE060FA0, dummy = 0x0, ierr = 0
#5 iotk_scan_x line: “iotk_scan.F90”@897 address: 0x7A1455
unit = 4, direction = 1, control = 2, name = 0x7FFFFE050F00, attr = 0x7FFFFE051000, binary = .FALSE., stream = .FALSE., found = .FALSE., ierr = 0
#4 iotk_scan_tag_x line: “iotk_scan.F90”@669 address: 0x7A02EE
unit = 4, direction = 1, control = 538976288, tag = 0x7FFFFE040E60, binary = .FALSE., stream = .FALSE., ierr = 0
#3 iotk_getline_x line: “iotk_scan.F90”@1014 address: 0x7A1B9B
unit = 4, line = 0x7FFFFE03FC80, length = 0, ierr = 0
#2 pgf90io_fmtr_init2003 address: 0x1CFEE39
*** Stack frames number 2 and higher may be incorrect ***
#1 fr_init file: fmtread.c address: 0x1D02C3A
***FP, local variables, and args, for frame numbers 1 and higher may be incorrect ***
=> #0 __GI___libc_realloc file: interp.c address: 0x351347A7CD

(I run the debugger in text mode since it seems I miss a library/program called “xrefresh” in the system).


I am working on the GPU porting of the code in that branch. The branch is compatible 100% with the version 4.3.2 but not with the current trunk (there are not aligned, some differences. We are going to merge them as soon as a new version is finalized). Using the code in the GPU branch, the same problem appears. Older PGI compilers work well, the 12.3 has the same issue…

Many thanks in advance for your support!

Cheers,
F.

I have a short follow-up. I tried with PGI 12.4 and the error still persists.

I can write to the User Support providing a link to the code, instructions to compile and an example. It is not necessary to have MPI, the serial code shows the same behavior (and crash).

F.

Hi fspiga,

My fault, I meant to get back to you on this but didn’t. When I tried to do an svn checkout, the server refuses my connection.

$ svn checkout svn://scm.qe-forge.org/scmrepos/svn/q-e/trunk
svn: Can’t connect to host ‘scm.qe-forge.org’: Connection refuse

I noticed that espresso 5.0 is available. Does the error occur there?

  • Mat

The anonymous access should work, this sounds weird. I am going to check tomorrow.

Anyhow yes, the problem happens also with the 5.0. I did a test this morning. You can download the file “expresso-5.0.tar.gz” here: http://qe-forge.org/frs/?group_id=10&release_id=116

About the test case, this:
http://www.fislab.disco.unimib.it/~filippo/PW-AUSURF54.tar.gz
is enough to reproduce the problem.

Many thanks in advance!
F.

Hi fspiga,

I downloaded espresso-5.0 and was able to recreate the segv in iotk_scan.f90. The error occurs with or without optimization in a call to a non-advancing read. I did notice that if I add “-D__IOTK_WORKAROUND1” and use the alternate advancing read, then the program runs correctly.

I’m still not sure if it’s a problem with our I/O run time library or a program error. I did note that the size returned from the read was 2046 while the buffer size is only 1024. Though, increasing the buffer size did not help. I’ll continue looking.

  • Mat

I am not aware about this flag (-D__IOTK_WORKAROUND1) but I am going to investigate too.

Many thanks in advance for your help!

Dear Mat,
the workaround works on my workstation and Linux cluster but it does not solve another problem that (honestly) it is not clear if it is related directly to PGI or to the CRAY environment.

In fact, the compiler crashes at this point:

make[2]: Entering directory /ufs/home/users/xxxxxx/espresso-PRACE/S3DE/iotk/src' ftn -fast -Mcache_align -r8 -Mpreprocess -mp -D__PGI -D__FFTW -D__CUDA -D__GPU_NVIDIA_20 -D__PHIGEMM -D__CUDA_QE_TIMING -D__OPENMP -D__IOTK_WORKAROUND1 -I../include -I/home/users/xxxxxx/espresso-PRACE/phiGEMM/include -I/opt/nvidia/cuda/4.0.17a/include -c iotk_print_kinds.f90 pgf90-Fatal-/opt/pgi/12.3.0/linux86-64/12.3/bin/pgf901 TERMINATED by signal 11 Arguments to /opt/pgi/12.3.0/linux86-64/12.3/bin/pgf901 /opt/pgi/12.3.0/linux86-64/12.3/bin/pgf901 iotk_print_kinds.f90 -opt 2 -terse 1 -inform warn -nohpf -nostatic -x 19 0x400000 -quad -x 59 4 -x 59 4 -x 15 2 -x 49 0x400004 -x 51 0x20 -x 57 0x4c -x 58 0x10000 -x 124 0x1000 -x 57 0xfb0000 -x 58 0x78031040 -x 70 0x6c00 -x 47 0x400000 -x 48 4608 -x 49 0x100 -x 120 0x200 -stdinc /opt/pgi/12.3.0/linux86-64/12.3/include:/usr/local/include:/usr/lib64/gcc/x86_64-suse-linux/4.3/include:/usr/lib64/gcc/x86_64-suse-linux/4.3/include:/usr/include -def unix -def __unix -def __unix__ -def linux -def __linux -def __linux__ -def __NO_MATH_INLINES -def __x86_64__ -def __LONG_MAX__=9223372036854775807L -def '__SIZE_TYPE__=unsigned long int' -def '__PTRDIFF_TYPE__=long int' -def __THROW= -def __extension__= -def __amd64__ -def __SSE__ -def __MMX__ -def __SSE2__ -def __SSE3__ -def __SSE4A__ -def __ABM__ -idir ../include -idir /home/users/xxxxxx/espresso-PRACE/phiGEMM/include -idir /opt/nvidia/cuda/4.0.17a/include -idir /opt/nvidia/cuda/4.0.17a/include -idir /opt/cray/udreg/2.3.1-1.0400.4264.3.1.gem/include -idir /opt/cray/ugni/2.3-1.0400.4374.4.88.gem/include -idir /opt/cray/dmapp/3.2.1-1.0400.4255.2.159.gem/include -idir /opt/cray/gni-headers/2.1-1.0400.4351.3.1.gem/include -idir /opt/cray/xpmem/0.1-2.0400.31280.3.1.gem/include -idir /opt/cray/pmi/3.0.1-1.0000.8917.33.1.gem/include -idir /opt/acml/4.4.0/pgi64_fma4_mp/include -idir /usr/include/alps -def __PGI -def __FFTW -def __CUDA -def __GPU_NVIDIA_20 -def __PHIGEMM -def __CUDA_QE_TIMING -def __OPENMP -def __IOTK_WORKAROUND1 -def __CRAYXE -def __CRAYXT_COMPUTE_LINUX_TARGET -def __TARGET_LINUX__ -freeform -preprocess -vect 48 -y 54 1 -mp -x 69 0x200 -x 69 0x400 -x 53 2 -quad -x 119 0x10000000 -quad -x 119 0x10000000 -x 124 0x8 -x 124 0x80000 -mp -x 69 0x200 -x 69 0x400 -modexport /tmp/pgf90aHSdanmMFMVc.cmod -modindex /tmp/pgf90aHSda1T3AEc3.cmdx -output /tmp/pgf90aHSdaLpKZHL8.ilm make[2]: *** [iotk_print_kinds.o] Error 127 make[2]: Leaving directory /ufs/home/users/xxxxxx/espresso-PRACE/S3DE/iotk/src’
make[1]: *** [libiotk] Error 2
make[1]: Leaving directory `/ufs/home/users/xxxxxx/espresso-PRACE/extlibs’
make: *** [libiotk] Error 2

The CRAY system is a XK6. These modules are loaded:

Currently Loaded Modulefiles:

  1. modules/3.2.6.6
  2. nodestat/2.2-1.0400.31264.2.5.gem
  3. sdb/1.0-1.0400.32124.7.19.gem
  4. MySQL/5.0.64-1.0000.5053.22.1
  5. lustre-cray_gem_s/1.8.6_2.6.32.45_0.3.2_1.0400.6453.5.1-1.0400.32127.1.90
  6. udreg/2.3.1-1.0400.4264.3.1.gem
  7. ugni/2.3-1.0400.4374.4.88.gem
  8. gni-headers/2.1-1.0400.4351.3.1.gem
  9. dmapp/3.2.1-1.0400.4255.2.159.gem
  10. xpmem/0.1-2.0400.31280.3.1.gem
  11. hss-llm/6.0.0
  12. Base-opts/1.0.2-1.0400.31284.2.2.gem
  13. xtpe-network-gemini
  14. PrgEnv-pgi/4.0.46
  15. xt-mpich2/5.5.0.6
  16. atp/1.4.4
  17. xt-asyncpe/5.11.13
  18. pmi/3.0.1-1.0000.8917.33.1.gem
  19. xt-totalview/8.10.0
  20. totalview-support/1.1.3
  21. pgi/12.3.0
  22. pbs/10.4.0.101257
  23. xtpe-interlagos
  24. cuda/4.0.17a
  25. acml/4.4.0

I already notify the CRAY User Support about this issue but I suspect they will forward to PGI because it looks like a compiler issue. Do you have any idea about how tweak the FTN wrapper to (maybe) change or remove some options that screw up the compilation?

Many thanks in advance again!

Dear Mat,

this is just a follow up you can eventually close internally this issue. I discovered that the problem was one of the component of the CRAY environment.

After unloading “atp” and “hss-llm” PGI does not complain anymore and the code compiles without problems!

Cheers,