I compared large vector dot product with SSE and CUDA.but in the loop,the SSE version has
this result:
numeric
-1#…
numeric
-1#…
.
.
.
.
.
.
the “numeric” should be difference every for iterations,but now it as the same.
and overflow in the odd iterations,why? this`s my SSE code :
[FONT=Courier]
float sdot(const float3* v0,const float3* v1,int n)
{
   float lmem;
   __asm{
       MOV     esi,DWORD PTR [v0]
       MOV     edi,DWORD PTR [v1]
       MOV     ecx,n
       PXOR    xmm0,xmm0
 CALC:
       MOVAPS xmm1,XMMWORD PTR [esi]
       MOVAPS xmm2,XMMWORD PTR [edi]
       MULPS xmm1,xmm2
       ADDPS xmm0,xmm1
       ADD     esi,16
       ADD     edi,16
       LOOP    CALC
       HADDPS xmm0,xmm0
       HADDPS xmm0,xmm0
       LEA    edi,DWORD PTR [lmem]
       MOVSS DWORD PTR [edi],xmm0
   }
   return lmem;
}
[/FONT]