There are actually several differences between “-fast” and “-fastsse” that can result in different answers when running the same code. First off, both -fast and -fastsse are really a set of optimizations which generally give the best performance. -fast is “-O2 -Munroll=c:1 -Mnoframe -Mlre” and -fastsse is -fast plus “-Mscalarsse -Mvect=sse -Mcache_align -Mflushz”.
The biggest difference are the “-Mscalarsse -Mvect=sse” flags which tells the compiler to generate SSE code, while -fast will generate x87 code. SSE is generally faster since its architecture is faster and it can perform multiple floating point calculations per clock cycle. While it’s harder to generate optimized code for x87 and x87 only performs one 80-bit calculation per cycle.
One reason why your seeing precision differences is because for double precision floating point values, SSE uses a 64-bit register while x87 uses a 80-bit register. Although values are truncated to 64-bits when stored to memory, a good compiler will try and keep values in the x87 register. As more and more calculations are done, the more impact the extra bits make. Also, SSE code will use different algorithms which can result in slightly different results.
In the FAQ section there a more detailed guide on precision issuses on an x86 systems that you might want to read. (See /support/execute.htm#precision).