GPU vs CPU (different answers..why?)

I am trying to see performance of CPU vs. GPU for a sparse matrix vector multiplication code. But, the answers for both computation differ. when i tried to run the GPU code in device emulation mode, the answers matches with the CPU.
I anyone knows how to overcome this problem please let me know.

(P.S. I have heard that by default FPU in CPU is set to 80 bit, hence al the operations are performed internally at 80 bits and then the answer is converted back to 64 bit on the other hand GPU operations are performed at 64 bit. Is this right? if so then how can I overcome this problem? Is it possible to perform operations on CPU at 64 bit instead of 80 bit (if what I just mentioned above is true)?

btw. I am using AMD 64 bit machine with fedora 10 and tesla c1060 GPU.

Yup… many of the x87 or IA32 kind of processors internally have 80-bit FP registers. So, my first guess is that you should have to leave with the different answers you’ve been observing :(
(not 100% sure though)

If you compile in 64-bit mode (I mean x86_64/AMD64/EM64T/Intel64…), then you are not affected by the 80-bit-related issues (your compiler uses “clean” SSE2 floating-point instructions instead of x87).

Since you mention 64-bit operations, I assume you only use double-precision arithmetic on the GPU?

If so, the only differences are:

  • Dependant muls and adds can be turned into Fused Multiply-Add operations on the GPU (not to be confused with single-precision MADs!), which can make the GPU code more accurate.

  • The implementation of transcendental functions (exp, sin, pow…) is different and usually slightly less accurate on the GPU.

By the way, there is also a related nvcc bug in the wild: read [topic=“103726”]Topic 103726[/topic] and [topic=“105700”]Topic 105700[/topic].

So try compiling with the -G or -Xopencc -O0 to see if there is this makes a difference.

hey thanks guys.

yes I am using double precision.

I’ll try to your advice of using -G or -Xopencc -O0. I’ll keep posting my result.