Same algorithm different results

I have a problem that I can’t wrap my mind around. I’m using the cutil_math.h in my project.

I have a function that looks like this and works:

inline __device__ __host__

float3 matrix4x4Mulf3( const float* m, const float3 &a )

{

  float x = a.x * m[0] + a.y * m[1] + a.z * m[2] + m[3];

 Â float y = a.x * m[4] + a.y * m[5] + a.z * m[6] + m[7];

 Â float z = a.x * m[8] + a.y * m[9] + a.z * m[10] + m[11];

 return make_float3(x,y,z);

}

an another implementation that looks like this and doesn’t work:

inline __device__ __host__

float3 matrix4x4Mulf3( const float* m, const float3 &a )

{

 Â float4 s = make_float4(a,1);

 Â float x = dot( s, make_float4( m[0], m[1], m[2], m[3] ) );

 Â float y = dot( s, make_float4( m[4], m[5], m[6], m[7] ) );

 Â float z = dot( s, make_float4( m[8], m[9], m[10], m[11] ) );

 return make_float3( s );

}

What’s also interesting to note is the fact that the non working version seems to require dramatically less registers. The kernels registers go from 31 down to 10 when using the second implementation, however the results are wrong.

Hi,

is it in the last operand in the second example, I mean:

 return make_float3(x,y,z);

instead of

 return make_float3( s );

Regards

Ilghiz

EDIT:
I really should sleep more…you’re right, I’m returning the wrong value.
There is almost no difference in speed with the code when comparing both implementations or register usage.

since you do not use x, y, z variables, the compiler optimize the source and remove all computational part concerned to these variables. There is a reason of register reduction.