REAL*16 implementation?

sergio.j.larrondo · December 14, 2004, 4:43pm

Are there any plans or workarounds to providing a REAL16? The max is REAL8 and we have some apps we’d like to port over from the Alpha but this seems to be a limitation.

MatColgrove · December 14, 2004, 9:03pm

Hello,

At this time we don’t plan on supporting REAL*16. This is due to the lack of hardware support and the extreme performance penalty of software emulation. Of course, if we see more demand then we’ll reconsider.

Thanks,
Mat

aragons · December 21, 2004, 10:13pm

In 32 bit systems, we have had double precision for a long time. A native word in a 32 bit system is real4. My basic question is this: why can’t the technology that was used to establish real8 in a 32 bit system WITHOUT significant execution penalty to provide us real*16 in a 64 bit system?

If Cray did it, and Dec did it with the alpha, why can’t PGI do it for Opteron? I think there are many of us in the pure number crunching community that would be quite interested in quad precision being done efficiently on a 64 bit system. There must be something I’m missing -please enlighten me.

Thanks.

MatColgrove · December 22, 2004, 5:02pm

In 32-bits, there is double precision hardware support. The x87 chip peforms 80-bit floating point calculations and SSE performs 64-bit. As you suggest, the ideal situation would be for the hardware vendors to also support quad precision so there wouldn’t be a severe performance penatly. Alas, this support is unavailable thus requiring software emulation for true REAL16 support. (Note that some implementations of REAL16 are really REAL*10 and use the x87 chip).

When we ask customers what they want to focus our efforts on, the overwhelming choice is high performance. Of the few people who ask for REAL*16, most decide that they only want this support if performance can be maintained. Yes Cray and Dec have created higher performance quad precsion packages for their own architectures. However, PGI is independent of any general computing chip manufacturer and seeks to provided equally high performance, no matter who the vendor. We do ship with our product as a matter of convienence AMD’s tuned math library ACML, but also work with Intel’s MKL.

There are several free libraries available on the web which will to emulate quad precsion using our compilers. From your favorite search engine, a search for “quad precision fortran library” will yield several solutions.

Good Luck,
Mat

Johnix · March 11, 2005, 7:30am

Hi,

I was just reading the AMD64 Architecture Programmer’s Manual. It sounds like the 128-bit media and scientific instructions have better performance than x87 instructions. And as it suggested replacing x87 code with 128-bit media code is the first choice of improving performance.

“Code written with 128-bit media floating-point instructions can operate in parallel on four times as many single-precision floating-point operands as can x87 floating-point code. This achieves potentially four times the computational work of x87 instructions that use single-precision operands. Also, the higher density of 128-bit media floating-point operands may make it possible to remove local temporary variables that would otherwise be needed in x87 floating-point code. 128-bit media code is easier to write than x87 floating-point code, because the XMM register file is flat rather than stack-oriented, and, in 64-bit mode there are twice the number of XMM registers as x87 registers.”

I am not sure whether I understand the idea. But if that is true quad-precision is naturally achieved and no penalty at all.

Thanks for comments.

MatColgrove · March 11, 2005, 4:13pm

Hi Johnix,

The AMD XMM floating point registers and instructions are 128-bit. However, the floating point data types are still only 32 and 64-bits (Section 4.4.6 of the AMD Architecture Guide). This means you can perform up to four simultaneous single-precision (32-bit) floating point calculations or two double precision (64-bit) floating point calculations when using vector or “packed” instructions. Using the “-Mvect=sse” optimization, which is part of the aggregate flag “-fastsse”, tells the compiler to generate these vector instructions.

Hope this helps,
Mat