OpenMPI 4.0.1 compilation fails with PGI 19.4 compiler

OpenMPI 4.0.1 compilation fails with PGI 19.4 compiler

Hi,

we try build OpenMPI 4.0.1 with the newest PGI 19.4 compiler on a DGX2, but the compilation fails at some point.

Our path and library path are set to:

PATH:            /opt/pgi/linux86-64/19.4/bin
LD_LIBRARY_PATH: /opt/pgi/linux86-64/19.4/lib

We run the configuration script with:

./configure CPP=cpp CC=pgcc CXX=pgc++ F77=pgf77 FC=pgf90 --prefix=/opt/openmpi --with-cuda=/opt/cuda/9.2 --without-verbs

After the configuration is finished we try to compile but the compilation fails at some point with the following error :

PGC-S-0039-Use of undeclared variable argv (oob_tcp_component.c: 1359)
PGC-S-0054-Subscript operator ([]) applied to non-array (oob_tcp_component.c: 1359)
PGC-W-0046-Non-integral array subscript is cast to int (oob_tcp_component.c: 1359)
PGC-S-0039-Use of undeclared variable save (oob_tcp_component.c: 1359)
.
.
.
.
PGC-F-0008-Error limit exceeded (oob_tcp_component.c: 1371)
PGC/x86-64 Linux 19.4-0: compilation aborted

The file where it starts to fail is:

orte/mca/oob/tcp/oob_tcp_component.c

Previously we compiled OpenMPI 4.0.1 with the PGI 19.1 compiler without problems but with 19.4 the build fails.

Hi,

Very sorry you have encountered this issue with the PGI 19.4 release. I have logged this issue in our system as TPR 27140.

In the meantime, I can suggest a couple of workarounds:

  • Revert to the 19.3 or 19.1 releases when compiling Open MPI.
  • Try using CC=pgcc18 instead of CC=pgcc when calling Open MPI’s configure script. pgcc18 uses a newer C frontend which will eventually become the default C frontend for PGI compilers.

Regards,

+chris

Thank you for your answer! I tried CC=pgcc18 and it worked!

Strangely enough, after my initial success at reproducing this problem, I have been trying to reproduce this problem again for our developers, and I am now unable to. I followed the instructions as written, although Open MPI’s ./configure script rejects the F77 and F90 variables. Can you give me some more information about the system you were testing this on? Which OS and OS version were you using?

Thanks,

+chris

Hello cparrott,

I have the same problem but the pgcc18 didn’t solve the problem.

MY OS details:
LSB Version: core-10.2019031300ubuntu1-noarch:security-10.2019031300ubuntu1-noarch
Distributor ID: Ubuntu
Description: Ubuntu 19.04
Release: 19.04
Codename: disco

Last lines with the compilation error:
make[2]: Entering directory ‘/home/ea461/Downloads/openmpi/openmpi-4.0.1/opal’
CCLD libopen-pal.la
make[2]: Leaving directory ‘/home/ea461/Downloads/openmpi/openmpi-4.0.1/opal’
Making all in mca/common/cuda
make[2]: Entering directory ‘/home/ea461/Downloads/openmpi/openmpi-4.0.1/opal/mca/common/cuda’
CC common_cuda.lo
PGC-S-0039-Use of undeclared variable OPAL_HWLOC201_hwloc_OBJ_CACHE (/usr/include/hwloc/helper.h: 490)
PGC-S-0060-online_cpuset is not a member of this struct or union (/usr/include/hwloc/helper.h: 840)
PGC-W-0095-Type cast required for this conversion (/usr/include/hwloc/helper.h: 840)
PGC-W-0155-Pointer value created from a nonlong integral type (/usr/include/hwloc/helper.h: 840)
PGC-S-0060-allowed_cpuset is not a member of this struct or union (/usr/include/hwloc/helper.h: 857)
PGC-W-0095-Type cast required for this conversion (/usr/include/hwloc/helper.h: 857)
PGC-W-0155-Pointer value created from a nonlong integral type (/usr/include/hwloc/helper.h: 857)
PGC-S-0060-allowed_nodeset is not a member of this struct or union (/usr/include/hwloc/helper.h: 909)
PGC-W-0095-Type cast required for this conversion (/usr/include/hwloc/helper.h: 909)
PGC-W-0155-Pointer value created from a nonlong integral type (/usr/include/hwloc/helper.h: 909)
PGC-S-0060-distances_count is not a member of this struct or union (/usr/include/hwloc/helper.h: 1069)
PGC-S-0060-distances is not a member of this struct or union (/usr/include/hwloc/helper.h: 1070)
PGC-S-0054-Subscript operator () applied to non-array (/usr/include/hwloc/helper.h: 1070)
PGC-S-0055-Illegal operand of indirection operator () (/usr/include/hwloc/helper.h: 1070)
PGC-S-0059-Struct or union required on left of . or → (/usr/include/hwloc/helper.h: 1070)
PGC-S-0060-distances is not a member of this struct or union (/usr/include/hwloc/helper.h: 1071)
PGC-S-0054-Subscript operator ([]) applied to non-array (/usr/include/hwloc/helper.h: 1071)
PGC-W-0095-Type cast required for this conversion (/usr/include/hwloc/helper.h: 1071)
PGC-W-0155-Pointer value created from a nonlong integral type (/usr/include/hwloc/helper.h: 1071)
PGC-S-0060-distances_count is not a member of this struct or union (/usr/include/hwloc/helper.h: 1123)
PGC-S-0060-distances is not a member of this struct or union (/usr/include/hwloc/helper.h: 1124)
PGC-S-0054-Subscript operator ([]) applied to non-array (/usr/include/hwloc/helper.h: 1124)
PGC-S-0055-Illegal operand of indirection operator (
) (/usr/include/hwloc/helper.h: 1124)
PGC-S-0059-Struct or union required on left of . or → (/usr/include/hwloc/helper.h: 1124)
PGC-S-0060-distances is not a member of this struct or union (/usr/include/hwloc/helper.h: 1125)
PGC-S-0054-Subscript operator () applied to non-array (/usr/include/hwloc/helper.h: 1125)
PGC-S-0055-Illegal operand of indirection operator (*) (/usr/include/hwloc/helper.h: 1125)
PGC-S-0059-Struct or union required on left of . or → (/usr/include/hwloc/helper.h: 1125)
PGC-S-0060-distances is not a member of this struct or union (/usr/include/hwloc/helper.h: 1128)
PGC-S-0054-Subscript operator () applied to non-array (/usr/include/hwloc/helper.h: 1128)
PGC-W-0095-Type cast required for this conversion (/usr/include/hwloc/helper.h: 1128)
PGC-W-0155-Pointer value created from a nonlong integral type (/usr/include/hwloc/helper.h: 1128)
PGC-S-0060-latency is not a member of this struct or union (/usr/include/hwloc/helper.h: 1162)
PGC-S-0060-latency is not a member of this struct or union (/usr/include/hwloc/helper.h: 1163)
PGC-W-0095-Type cast required for this conversion (/usr/include/hwloc/helper.h: 1163)
PGC-W-0155-Pointer value created from a nonlong integral type (/usr/include/hwloc/helper.h: 1163)
PGC-S-0060-nbobjs is not a member of this struct or union (/usr/include/hwloc/helper.h: 1164)
PGC-F-0008-Error limit exceeded (/usr/include/hwloc/helper.h: 1164)
PGC/x86-64 Linux 19.4-0: compilation aborted
make[2]: *** [Makefile:1939: common_cuda.lo] Error 1
make[2]: Leaving directory ‘/home/ea461/Downloads/openmpi/openmpi-4.0.1/opal/mca/common/cuda’
make[1]: *** [Makefile:2375: all-recursive] Error 1
make[1]: Leaving directory ‘/home/ea461/Downloads/openmpi/openmpi-4.0.1/opal’
make: *** [Makefile:1893: all-recursive] Error 1

It seems the problem was solved by installing CUDA using cuda_10.1.168_418.67_linux.run from the NVIDIA website instead by apt