I have a problem that I can’t wrap my mind around. I’m using the cutil_math.h in my project.
I have a function that looks like this and works:
inline __device__ __host__
float3 matrix4x4Mulf3( const float* m, const float3 &a )
{
float x = a.x * m[0] + a.y * m[1] + a.z * m[2] + m[3];
 float y = a.x * m[4] + a.y * m[5] + a.z * m[6] + m[7];
 float z = a.x * m[8] + a.y * m[9] + a.z * m[10] + m[11];
 return make_float3(x,y,z);
}
an another implementation that looks like this and doesn’t work:
inline __device__ __host__
float3 matrix4x4Mulf3( const float* m, const float3 &a )
{
 float4 s = make_float4(a,1);
 float x = dot( s, make_float4( m[0], m[1], m[2], m[3] ) );
 float y = dot( s, make_float4( m[4], m[5], m[6], m[7] ) );
 float z = dot( s, make_float4( m[8], m[9], m[10], m[11] ) );
 return make_float3( s );
}
What’s also interesting to note is the fact that the non working version seems to require dramatically less registers. The kernels registers go from 31 down to 10 when using the second implementation, however the results are wrong.