Executing mpich on different hardware - Illegal instruction

Hi,

I have a cluster configured with same machine hardware. This cluster has been installed with PGI 14.1 (Fortran and C) and Mpich v3.0.4.

The cpu info of master and slave machines are:

Master: Intel® Core™ i7-4771 CPU @ 3.50GHz
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm ida arat xsaveopt pln pts dtherm tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm

Slaves: Intel® Core™ i7-4770 CPU @ 3.40GHz
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm ida arat epb xsaveopt pln pts dtherm tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm


Now I am adding more slave machines on cluster. But, these new slave machines have not the same hardware, they are older in relation my current machines of cluster.

The cpu info of older slave machines are:

Slaves: Intel® Xeon® CPU E5520 @ 2.27GHz
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm dca sse4_1 sse4_2 popcnt lahf_lm ida dtherm tpr_shadow vnmi flexpriority ept vpid

When I try to run mpiexec on this older slave machines it is appearing the message “illegal instruction” and the processes are not executed.

I think that the problem is difference between cpu’s flags machines.

Are there configuration to become compatible the PGI and mpich on different hardware of slave machines?

On master machine I compiled PGI with default options.
Mpich was compiled with:
export CC=pgcc
export FC=pgf90
export F77=pgf77
export CXX=pgcpp
export CPP=cpp
export LD=ld
export CFLAGS=’-fast’
export F77FLAGS=’-fast’
export FFLAGS=’-fast’
export CXXFLAGS=’-fast’
./configure --prefix=/share/apps/mpich3.0.4 --enable-static --with-device=ch3:nemesis --with-pm=hydra --enable-shared --enable-debuginfo




Thanks,
Pedro Ivo Diógenis

Hi Pedro,

Try adding “-tp px” to each of the FLAGS variables you list.

PGI by default generates target code for the system you are currently compiling on. So, if you happen to be building MPICH on a Sandy Bridge system, then you will get an MPICH that uses instructions introduced with the Sandy Bridge architecture. Obviously, this will not run when you move to a pre-Sandy Bridge system.

The “-tp px” flag tells the PGI compiler to override this default, and generate target code for a generic x86 processor. This should allow the code to run everywhere. In this case, I believe this is exactly what you want to do.

Hope this helps.

Best regards,

+chris

Hi,

Thanks for answer.

I did the changes at FLAGS, adding ‘-tp px’.

export CC=pgcc
export FC=pgf90
export F77=pgf77
export CXX=pgcpp
export CPP=cpp
export LD=ld
export CFLAGS=’-fast -tp px’
export F77FLAGS=’-fast -tp px’
export FFLAGS=’-fast -tp px’
export CXXFLAGS=’-fast -tp px’


Unfortunately, it does not resolved my problem.

===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= EXIT CODE: 132
= CLEANING UP REMAINING PROCESSES
= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES

YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Illegal instruction (signal 4)
This typically refers to a problem with your application.
Please see the FAQ page for debugging suggestions


Best regards,
Pedro Ivo Diógenis

Dear,

I added the FCFLAGS variable and my processes ran.

export CC=pgcc
export FC=pgf90
export F77=pgf77
export CXX=pgcpp
export CPP=cpp
export LD=ld
export CFLAGS=’-fast -tp px’
export F77FLAGS=’-fast -tp px’
export FFLAGS=’-fast -tp px’
export CXXFLAGS=’-fast -tp px’
export FCFLAGS=’-fast -tp px’


Now, I need to run complex processes. The WRF model does not compile with this otimization level that you indicated. WRF does not compile with -O3, due to this modification on Mpich flags.

I reduced otimization level to -O2 in configure.wrf and I am trying to compile this model.

I work with mathematic modelling (CMAQ and WRF).

Best regards,
Pedro Ivo Diógenis

Hi Chris,

The command “pgfortran -V” shows:
Master --> pgfortran 14.1-0 64-bit target on x86-64 Linux -tp haswell
Slaves --> pgfortran 14.1-0 64-bit target on x86-64 Linux -tp haswell
Old slaves --> pgfortran 14.1-0 64-bit target on x86-64 Linux -tp nehalem

I tried many alternatives of Mpich and PGI flags. I was unable to run processes with mpich.

I tried the follow flags:
-tp px
-tp core2-64
-tp x64,sandybridge-64 - fast
-tp x64

Now, these “-tp flags” allowed the mpiexec only runs simple processes. When I try to run complex processes the message shows “illegal instruction (signal 4)”.

The WRF compiles, but he does not run and the message shows “illegal instruction (signal 4)”.

Unfortunately, nothing works for me.

Do you have any idea?

Best regards,
Pedro Ivo Diógenis

Hi Pedro,

Which flags are you passing to compile WRF? Note that the flags passed to compile WRF are independent of the flags used to compile MPICH.

Bottom line: if you are running on a mix of Haswell and Nehalem systems, you will need to pass -tp nehalem when building both MPICH and WRF. If you only pass this flag when building MPICH, then you will still get Haswell instructions when compiling WRF unless you pass this flag during the WRF build as well.

Alternately, you might try logging into one of the Nehalem systems (if possible) and building both MPICH and WRF there. The resulting builds should run fine on Haswell systems.

Hope this helps,

+chris