mpif90 seg faulting with default llvm 19.3

Hi,
I just installed PGI 19.3.
I set my bin and lib paths to point to the mpi library in

/pgi/linux86-64/2019/mpi/openmpi-3.1.3/

When I run mpif90 I get:

/usr/local/pgi/linux86-64/2019/mpi/openmpi-3.1.3/bin/mpif90: line 15: 12699 Segmentation fault      (core dumped) $MY_DIR/.bin/$EXE "$@"

If I instead set all my paths to to non-llvm directories:

 pgi/linux86-64-nollvm/2019/mpi/openmpi-3.1.3

then it works fine.
I noticed that the linux86-64 directory contains a lot of links to the linux86-64-llvm, so it seems to be an llvm+mpi issue?

Also, for the case that works, I am running into a strange compile flag problem. If I compile my OpenACC code with -acc it compiles fine.
But if I try to specify the cc or cuda version (or both), the compiler keeps saying my “ta” keywords are unkown:

mpif90 -O3 -ta=telsa:cc60 -Minfo=accel pgi_1710_bug_reproduce.f -o pgi_1710_bug_reproduce_acc1903_nollvm
pgfortran-Error-Switch -ta with unknown keyword telsa:cc60

This is strange since the compile options are identical to how I usually compile my OpenACC codes.

Thanks!

Hi Ron,

I noticed that the linux86-64 directory contains a lot of links to the linux86-64-llvm, so it seems to be an llvm+mpi issue?

What’s you LD_LIBRARY_PATH env variable set to? Could you be picking up the wrong runtime libraries?

pgfortran-Error-Switch -ta with unknown keyword telsa:cc60

Typo in your flag name. It’s “tesla”, not “telsa”.

-Mat

Hi,

My paths are set as follows:

export PATH=/usr/local/pgi/linux86-64/2019/bin:${PATH}
export LD_LIBRARY_PATH="/usr/local/pgi/linux86-64/2019/lib:${LD_LIBRARY_PATH}"

export PATH=/usr/local/pgi/linux86-64/2019/mpi/openmpi-3.1.3/bin:${PATH}
export LD_LIBRARY_PATH="/usr/local/pgi/linux86-64/2019/mpi/openmpi-3.1.3/lib:${LD_LIBRARY_PATH}"

export CUDA_HOME=/usr/local/pgi/linux86-64/2019/cuda/10.1
export LD_LIBRARY_PATH=/usr/local/pgi/linux86-64/2019/cuda/10.1/lib64:${LD_LIBRARY_PATH}

For the no-llvm mode, I simply replaced all “linux86-64” with “linux86-64-nollvm” above.

Sorry about the typo! Guess I should read what I am typing :)

Hi Ron,

Do you have a reproducer you could send me or any details on where the segv is occurring? Is it in the OpenMPI driver or the compiler?

Thanks,
Mat

This problem is still there in 19.4 as follows:

export PATH=/usr/local/pgi/linux86-64/2019/bin:${PATH}
export LD_LIBRARY_PATH="/usr/local/pgi/linux86-64/2019/lib:${LD_LIBRARY_PATH}"

# OpenMPI (PGI included) bin and run-time path:

export PATH=/usr/local/pgi/linux86-64/2019/mpi/openmpi-3.1.3/bin:${PATH}
export LD_LIBRARY_PATH="/usr/local/pgi/linux86-64/2019/mpi/openmpi-3.1.3/lib:${LD_LIBRARY_PATH}"

# Set CUDA path

export CUDA_HOME=/usr/local/pgi/linux86-64/2019/cuda/10.1
export LD_LIBRARY_PATH=/usr/local/pgi/linux86-64/2019/cuda/10.1/lib64:${LD_LIBRARY_PATH}



~ $ mpif90
/usr/local/pgi/linux86-64/2019/mpi/openmpi-3.1.3/bin/mpif90: line 15: 14268 Segmentation fault      (core dumped) $MY_DIR/.bin/$EXE "$@"

Hi Ron,

Unfortunately without a reproducing example, I can’t really tell what’s going wrong.

It’s possible that the error is occurring in the OpenMPI driver “opal_wrapper” which is what “$MY_DIR/.bin/$EXE” is. Though, I can’t be sure. If so, we’d need to coordinate with the OpenMPI folks since it’s not something we could fix.

Does the error occur if you use a different MPI implementation such as MVAPICH or MPICH?

-Mat

Hi,

I do not know how to make a “reproducer”.
This is an issue independent of any code.

I simply run the installation script for PGI 19.4 (saying yes to all questions) and set my environment variables as shown in my latest post, and then try to run “mpif90” which then gives the seg fault.

I am doing this on Linux Mint 19.1 (Ubuntu 18.04).

  • Ron

Hmm, I wonder if it’s something as simple as you having the wrong runtime libraries in your LD_LIBRARY_PATH?

For example, if you are using the LLVM mpirun, but have your LD_LIBRARY_PATH set to use the non-LLVM runtime.

What’s LD_LIBRARY_PATH set to?

-Mat

Hi,

See my post a couple of posts back in this thread.

How I set my LD_LIBRARY_PATH is shown there.

  • Ron

Apologies that this is a bit frustrating, but since I can’t reproduce the issue here nor have we had any similar reports, I can only try to guess what the issue can be.

Since the directories under “/usr/local/pgi/linux86-64/” are just links, can you do a “ls -l” on this directory to make sure these are pointing to the LLVM compilers?

Alternatively, can you try setting the PATH and LD_LIBRARY_PATH to use “/usr/local/pgi/linux86-64-llvm/” to see if bypassing the links help?

Finally, let’s see which tool is actually causing the segv. Can you run your compile using:

sh -x /usr/local/pgi/linux86-64/2019/mpi/openmpi-3.1.3/bin/mpif90

I did a local installation of 19.4 here and see the following output:

/local/home/colgrove$ sh -x /local/home/colgrove/pgi/linux86-64/2019/mpi/openmpi-3.1.3/bin/mpif90
+ basename /local/home/colgrove/pgi/linux86-64/2019/mpi/openmpi-3.1.3/bin/mpif90
+ EXE=mpif90
+ readlink -f /local/home/colgrove/pgi/linux86-64/2019/mpi/openmpi-3.1.3/bin/mpif90
+ MY_PATH=/local/home/colgrove/pgi/linux86-64-llvm/2019/mpi/openmpi-3.1.3/bin/env.sh
+ dirname /local/home/colgrove/pgi/linux86-64-llvm/2019/mpi/openmpi-3.1.3/bin/env.sh
+ MY_DIR=/local/home/colgrove/pgi/linux86-64-llvm/2019/mpi/openmpi-3.1.3/bin
+ readlink -f /local/home/colgrove/pgi/linux86-64-llvm/2019/mpi/openmpi-3.1.3/bin/..
+ OMPI_ROOT=/local/home/colgrove/pgi/linux86-64-llvm/2019/mpi/openmpi-3.1.3
+ export OPAL_PREFIX=/local/home/colgrove/pgi/linux86-64-llvm/2019/mpi/openmpi-3.1.3
+ export MPILIBNAME=openmpi
+ export MPILIBVER=3.1.3
+ export MPIDIR=/local/home/colgrove/pgi/linux86-64-llvm/2019/mpi/openmpi-3.1.3
+ /local/home/colgrove/pgi/linux86-64-llvm/2019/mpi/openmpi-3.1.3/bin/.bin/mpif90
pgfortran-Warning-No files to process

-Mat

Hi,

My directory looks like this:

PREDSCI-GPU2: /usr/local/pgi/linux86-64 $ ls -l
total 40
drwxr-xr-x 25 sumseq sumseq 4096 Nov  6  2017 17.10
drwxr-xr-x 25 sumseq sumseq 4096 Nov  8  2017 17.9
drwxr-xr-x 26 sumseq sumseq 4096 May  1 15:31 18.1
drwxr-xr-x 26 sumseq sumseq 4096 May  1 15:31 18.10
drwxr-xr-x 26 sumseq sumseq 4096 May  1 15:31 18.3
drwxr-xr-x 26 sumseq sumseq 4096 May  1 15:31 18.4
drwxr-xr-x 26 sumseq sumseq 4096 May  1 15:31 18.7
lrwxrwxrwx  1 root   root     35 May  1 15:31 19.4 -> /usr/local/pgi/linux86-64-llvm/19.4
drwxr-xr-x  7 sumseq sumseq 4096 Nov  8  2017 2017
drwxr-xr-x  7 sumseq sumseq 4096 May  1 15:31 2018
lrwxrwxrwx  1 root   root     35 May  1 15:31 2019 -> /usr/local/pgi/linux86-64-llvm/2019
drwxr-xr-x  2 sumseq sumseq 4096 May  1 15:30 flexlm

Setting my ENV top point to the llvm directories directly does not fix the problem.

The command you requested yields:

PGI1904: /usr/local/pgi/linux86-64 $ sh -x /usr/local/pgi/linux86-64/2019/mpi/openmpi-3.1.3/bin/mpif90
+ basename /usr/local/pgi/linux86-64/2019/mpi/openmpi-3.1.3/bin/mpif90
+ EXE=mpif90
+ readlink -f /usr/local/pgi/linux86-64/2019/mpi/openmpi-3.1.3/bin/mpif90
+ MY_PATH=/usr/local/pgi/linux86-64-llvm/2019/mpi/openmpi-3.1.3/bin/env.sh
+ dirname /usr/local/pgi/linux86-64-llvm/2019/mpi/openmpi-3.1.3/bin/env.sh
+ MY_DIR=/usr/local/pgi/linux86-64-llvm/2019/mpi/openmpi-3.1.3/bin
+ readlink -f /usr/local/pgi/linux86-64-llvm/2019/mpi/openmpi-3.1.3/bin/..
+ OMPI_ROOT=/usr/local/pgi/linux86-64-llvm/2019/mpi/openmpi-3.1.3
+ export OPAL_PREFIX=/usr/local/pgi/linux86-64-llvm/2019/mpi/openmpi-3.1.3
+ export MPILIBNAME=openmpi
+ export MPILIBVER=3.1.3
+ export MPIDIR=/usr/local/pgi/linux86-64-llvm/2019/mpi/openmpi-3.1.3
+ /usr/local/pgi/linux86-64-llvm/2019/mpi/openmpi-3.1.3/bin/.bin/mpif90
Segmentation fault (core dumped)

This is quite strange. I also tried installing the compiler on another system we have here running Ubuntu 16.04 (my machine is Ubuntu 18.04) but get the same results.

I could try removing everything and starting from scratch… maybe there was a problem in the “update 2019 links” stage of the installation?

  • Ron

Hi,

Quick followup:

I deleted my entire pgi directory and did a clean installation of 19.4.

I still get the seg fault for mpif90 when setting my ENV to point to linux86-64 (or linux86-64-llvm).

Is there a performance difference between the llvm and no-llvm compilers?

  • Ron

Hi Ron,

All the links look fine. The only thing different is that 19.4 looks to have been installed by root. Though the permissions on the link are full read, write, execute permissions. Maybe the “.bin/mpif90” doesn’t have execute permissions? Though, if this were the case I’d expect a different error.

By chance, does the error occur if you run as root?

-Mat

Hi,

When I run using sudo it does not seg fault!

PGI1904: ~ $ mpif90
/usr/local/pgi/linux86-64/2019/mpi/openmpi-3.1.3/bin/mpif90: line 15: 13451 Segmentation fault      (core dumped) $MY_DIR/.bin/$EXE "$@"
PGI1904: ~ $ sudo mpif90
gfortran: fatal error: no input files
compilation terminated.

I have always installed PGI as sudo since the /usr/local/ directory is owned by root.

Is it better practice to install PGI not as root?

  • Ron

Hi,

FYI I tried to install PGI NOT as root so the installation directory is owned by my user.

When I do this, mpif90 still seg faults unless I run it with sudo.

  • Ron

Great! Not a solution but at least it’s a step forward.

First, let’s try the simplest possibility that it’s environmental, like a stack overflow. Can you try setting your user shell’s stack size to unlimited?

Next, do a “ldd” on “/usr/local/pgi/linux86-64-llvm/2019/mpi/openmpi-3.1.3/bin/.bin/mpif90” and check the permissions of the dependent libraries.

FYI, Ron emailed me directly and we were eventually able to find the issue.

His “.bin/mpif90” was pull in libz.so (via a setting in LD_LIBRARY_PATH) that was built with the non-LLVM compilers which isn’t compatible with the LLVM shared objects. He rebuilt libz.so with the LLVM compilers and it ran correctly. I worked under root since root was using the system libz.so.

-Mat

Also,

The LLVM PGI compiler would not work until I manually ran the “makelocalrc” script in the PGI bin directory to generate a localrc file.
For some reason, the installation of PGI did not do that step on my system.

Once that step was done, the LLVM worked to compile libz, and then mpif90 worked.

  • Ron