21.7 Illegal instruction (core dumped)

Andy_May · July 23, 2021, 9:51am

When running any program compiled with 21.7 compilers I see:

/host/test> nvfortran test.f90
/host/test> ./a.out 
Illegal instruction (core dumped)
/host/test> cat test.f90
program main
end

There were no such problems with the 21.5 compilers. The compiler is installed in a container and run on a reasonably old machine, my guess is some instruction (avx?) is now required that previously was not the case, but I can’t see anything in the release notes, any ideas? Probably I can switch the container to a different machine if necessary, but it would be good to know which instruction is now required, and I guess whether it’s available in the free runners on gitlab etc.

Many thanks,

Andy

MatColgrove · July 23, 2021, 5:36pm

Hi Andy,

Yes, typically this type of issue is when running a binary built for a newer processor is run on an older processor without support for a new instruction set.

The compiler auto-detects the processor on which the binary is built and will compile the code accordingly. So I’m not sure if the problem is that processor is one that we no longer support so the compiler is not detecting it, it’s somehow miss detecting the processor due to the container, or something in our runtime is not guarded so a new instruction is being used even on an older processor.

Can you provide the output from the command “nvcpuid” so we can see what compiler is detecting as the processor? Is the info correct? If not, what’s the actual processor being used?

What happens if you manually set the target processor flag (-tp) as shown in the nvcpuid output? or if you use “-tp px” (target generic x86)?

-Mat

Andy_May · July 23, 2021, 7:29pm

Thanks for your reply, the output of nvcpuid from 21.7 install is:

> nvcpuid
vendor id       : GenuineIntel
model name      : Intel(R) Xeon(R) CPU           E5540  @ 2.53GHz
cpu family      : 6
model           : 26
name            : Nehalem 45nm
stepping        : 5
processors      : 16
threads         : 2
clflush size    : 8
L2 cache size   : 256KB
L3 cache size   : 8192KB
flags           : acpi apic cflush cmov cplds cx8 cx16 de dtes ferr fpu fxsr
flags           : ht lm mca mce mmx monitor msr mtrr nx pae pat pdcm pge
flags           : popcnt pse pseg36 selfsnoop speedstep sep sse sse2 sse3
flags           : ssse3 sse4.1 sse4.2 syscall tm tm2 tsc vme xtpr
default target  : -tp nehalem

This is identical to the working 21.5 container. Trying to set -tp option doesn’t help I’m afraid:

/host/test> nvfortran test.f90
/host/test> ./a.out 
Illegal instruction (core dumped)
/host/test> nvfortran -tp px test.f90
/host/test> ./a.out 
Illegal instruction (core dumped)
host/test> nvfortran -tp nehalem test.f90
/host/test> ./a.out 
Illegal instruction (core dumped)

All of those cases run fine with 21.5.

Andy

MatColgrove · July 26, 2021, 7:39pm

Hi Andy,

I talked with engineering and it looks like they discontinued support for non-AVX enabled x86_64 processors so you wont be able to use 21.7 with this system. They did miss documenting this and putting the appropriate checks in the compiler drivers for which we apologize and will get corrected.

-Mat

Andy_May · July 29, 2021, 8:19pm

Mat,

Thanks very much for looking into this for me, now I know it’s expected behaviour to fail for non-AVX I’ll look at moving the container to a different machine.

Andy

Paul_H_Hargrove · August 2, 2021, 2:01am

For the record, there seems to be a similar issue with at least the math builtins for C and C++. So, I’ll describe the issue here with the expectation that it might help another user find this issue in a search.

I have seen the exact same issue with Fortran on a Nahalem system. In addition to that, we see that a near-trivial (silly in this example) use of math libs from C or C++ fails on (only) such older systems as follows:

$ cat badmath.c
#include <math.h>
int main(int argc) { return (int)sin((double)argc); }
$ nvc -lm badmath.c && ./a.out
Error during math dispatch processing...
__nvmath_abort:Math dispatch table is either misconfigured or corrupted.

I can (so far) work-around this using -Mnobuiltin. However, based on Mat’s statement that CPUs of this age are no longer supported, I am not going to assume that is a complete fix for the issue.

guscorrea · September 27, 2022, 6:26pm

Same issue reported by Andy_May here: compiles with nvfortran, then core dump at runtime. Very simple “Hello, World” code in Fortran.

The processor is Westmere:

vendor id : GenuineIntel
model name : Intel(R) Xeon(R) CPU E5620 @ 2.40GHz
cpu family : 6
model : 44
name : Westmere 32nm
stepping : 2
cores : 12328
sockets : 2
processors : 16
threads : 2
clflush size : 8
L1i cache size : 32KB
L1d cache size : 32KB
L2 cache size : 256KB
L3 cache size : 12288KB
flags : acpi aes apic cflush cmov cplds cx8 cx16 de dtes ferr fpu
flags : fxsr ht lm mca mce mmx monitor msr mtrr nx pae 1GBpages pat
flags : pdcm pge popcnt pse pseg36 rdtscp selfsnoop speedstep sep
flags : sse sse2 sse3 ssse3 sse4.1 sse4.2 syscall tm tm2 tsc vme
flags : xtpr
default target : -tp nehalem

I tried to compile with -tp=native but got the same core dump error.

man nvfortran shows that Sandy Bridge is the oldest architecture supported.
Too bad … Back compatibility was a hallmark of PGI compilers.
NVidia could do the same, everybody has old but still functional machines worth using.

Which version of the NVidia HPC SDK still supports Westmere?

Thank you,
Gus Correa

MatColgrove · September 27, 2022, 6:53pm

Hi Gus,

No, sorry, we require the CPU to provide AVX support. Per our release notes:

Programs generated by the HPC Compilers for x86_64 processors require a minimum of AVX instructions, which includes Sandy Bridge and newer CPUs from Intel, as well as Bulldozer and newer CPUs from AMD. POWER 8 and POWER 9 CPUs from the POWER architecture are supported.

-Mat

guscorrea · September 27, 2022, 7:12pm

Thank you Mat.

Could you please tell us what is the newestversion of NVidia HPC SDK that still supports Westmere, Nehalem,
and other still useful museum relics?
(OK, I don’t have Pentium III’s anymore … but they were
the workhorse in our first HPC cluster.)

Actually, is there a Release Notes document for each version that
would have this information for different processors, what is supported, what is phased out?
That would be very helpful.
We all love - and use - our old machines!

Thank you,
Gus

MatColgrove · September 28, 2022, 4:11pm

Hi Gus,

If I remember correctly, support for Nehalem was dropped mid-2018 timeframe. Keep in mind that by support it doesn’t mean that it wont work, but rather we specifically test on this system nor would fix Nehalem specific issues.

I believe the AVX change started in 21.9, so 21.7 should still target non-AVX systems. The main change was made so we could add AVX support to our runtime and math libraries thus boosting the performance of both. Given there are very few non-AVX system left in production, it seemed best to make this change to help the wider community as opposed not using AVX in order to support discontinued processors.

The best place to look for the supported systems is by running the compilers with the “-help -tp” flag. I can ask about adding the list to the release notes, but personally I’d be hesitant to add this. The list of target processors is more of a convent way to set the baseline set of CPU instructions rather than targeting a specific architecture.

Consider a CPU vendor that has two versions of an architecture, revA and revB, that use the same instruction sets but may vary in other ways. From the compiler perspective, they are the same so we just include revA under the -tp switch. If we only document revA, then does this cause confusion that we don’t support revB? Again the compiler sees them as the same, so the answer is no, but I have folks as this. We could list the instruction sets, but this is a long list and most folks aren’t too familiar with all the different sets.

Sorry for the long winded answer, but it just illustrates the issues with trying to document these details and why simplifying the release note to just say a minimum of AVX support is required. For the vast majority of folks, the details don’t matter and only cause confusion.

-Mat