 # Is it possible to avoid optimization? Compiler breaks memory coalescing.

Hey,

I am benchmarking my algorithms and came across an optimizer problem.

The following algorithm is calculating the square-sum of a vector, that is: sum(v[i]^2, for all i).

Since other parts of my program is more complex than this algorithm, I use an array of float4 as data representation (I know it’s slower than f32). In my square-sum I use only the (x,y,z)-values from each float4.

My code looks something like this:

// first a sub-function…

``````__device__ inline float

square_float4(float4 &v)

{

return v.x*v.x + v.y*v.y + v.z*v.z;

}
``````

// …then the kernel

``````kernel_SquareSum(float4* iVec, int vLength, int Iterations)

{

..indexing stuff and shared memory definitions..

float sum = 0.0f;

while (Iterations > 0)

{

if (v_idx < vLength)

{

float4 f4 = iVec[v_idx];

v_idx += blockDim.x;

sum += square_float4(f4);

}

Iterations--;

}

..send sum to a shared memory block..

..perform a sum reduction and save the result somewhere..

}
``````

Since the square_float4 doesn’t use the (.w)-value of the float4, the compiler breaks the global.v4 read into one global.v2 (x,y) + one global.f32 (z). The result is that my algorithm breaks the coalescing rules.

if I rewrite my square_float4 to include the (.w)-value:

``````square4_float4(float4 &v)

{

return v.x*v.x + v.y*v.y + v.z*v.z + v.w*v.w;

}
``````

the compiler cannot optimize away the (.w)-value and hence not break the coalescing. However, this means that I waste one f32-mul for each square4_float4.

Is there a good way to force the compiler to not perform such optimizations?

And if not, how can I trick the compiler to believe that I’m using the (.w)-value, without performing an arithmetic-op on it?

ps. I have many similar situations in my code, threrefore I would be glad to find an easy solution to it. ds

Look at slide 25 from this presentation http://www.gpgpu.org/sc2007/SC07_CUDA_3_Libraries.pdf

You can use volatile to force the vector load

Will do,

Did you also know that you can read float4’s full speed if you just bind a texture to the device memory and read with tex1Dfetch?

Thanks for the tip,

however, I try to keep data that I cannot use coalesced reads on in the texture cache. Hence, I am trying to avoid texture reads when it is not necessary, since I don’t want to pollute the texture cache.

But it is definitive worth investigating where/if it is possible to get a speed-up by using coalesced texture reads.

Does anybody know when/if the issue with reading float4 at full speed will be fixed?

I believe reading somewhere that the bug was closed in the database, so next release probably. Or maybe that was another bug, I’ll try to find it.

Normally, I would agree with you. The extra costs of using the texture unit mean it should not be used when a coalesced read is possible. BUT, in the case of float4 (and I assume int4, never tested it) you can get significantly increased throughputs (29.45 GiB/s vs 69.47 GiB/s) using the texture unit. It is worth it in this case.

See my performance tests of the various ways of reading device memory here. http://forums.nvidia.com/index.php?showtop…41&#entry290441

Rite,

I will give it a shot.