Hey,
I am benchmarking my algorithms and came across an optimizer problem.
The following algorithm is calculating the square-sum of a vector, that is: sum(v[i]^2, for all i).
Since other parts of my program is more complex than this algorithm, I use an array of float4 as data representation (I know it’s slower than f32). In my square-sum I use only the (x,y,z)-values from each float4.
My code looks something like this:
// first a sub-function…
__device__ inline float
square_float4(float4 &v)
{
return v.x*v.x + v.y*v.y + v.z*v.z;
}
// …then the kernel
kernel_SquareSum(float4* iVec, int vLength, int Iterations)
{
..indexing stuff and shared memory definitions..
v_idx = threadIdx.x;
float sum = 0.0f;
while (Iterations > 0)
{
if (v_idx < vLength)
{
float4 f4 = iVec[v_idx];
v_idx += blockDim.x;
sum += square_float4(f4);
}
Iterations--;
}
..send sum to a shared memory block..
..perform a sum reduction and save the result somewhere..
}
Since the square_float4 doesn’t use the (.w)-value of the float4, the compiler breaks the global.v4 read into one global.v2 (x,y) + one global.f32 (z). The result is that my algorithm breaks the coalescing rules.
if I rewrite my square_float4 to include the (.w)-value:
square4_float4(float4 &v)
{
return v.x*v.x + v.y*v.y + v.z*v.z + v.w*v.w;
}
the compiler cannot optimize away the (.w)-value and hence not break the coalescing. However, this means that I waste one f32-mul for each square4_float4.
Is there a good way to force the compiler to not perform such optimizations?
And if not, how can I trick the compiler to believe that I’m using the (.w)-value, without performing an arithmetic-op on it?
ps. I have many similar situations in my code, threrefore I would be glad to find an easy solution to it. ds