Is it possible to avoid optimization? Compiler breaks memory coalescing.


I am benchmarking my algorithms and came across an optimizer problem.

The following algorithm is calculating the square-sum of a vector, that is: sum(v[i]^2, for all i).

Since other parts of my program is more complex than this algorithm, I use an array of float4 as data representation (I know it’s slower than f32). In my square-sum I use only the (x,y,z)-values from each float4.

My code looks something like this:

// first a sub-function…

__device__ inline float

square_float4(float4 &v)


  return v.x*v.x + v.y*v.y + v.z*v.z;


// …then the kernel

kernel_SquareSum(float4* iVec, int vLength, int Iterations)


..indexing stuff and shared memory definitions..

v_idx = threadIdx.x;

float sum = 0.0f; 

while (Iterations > 0)


  if (v_idx < vLength)


    float4 f4 = iVec[v_idx];

    v_idx += blockDim.x;

    sum += square_float4(f4);




..send sum to a shared memory block..

..perform a sum reduction and save the result somewhere..


Since the square_float4 doesn’t use the (.w)-value of the float4, the compiler breaks the global.v4 read into one global.v2 (x,y) + one global.f32 (z). The result is that my algorithm breaks the coalescing rules.

if I rewrite my square_float4 to include the (.w)-value:

square4_float4(float4 &v)


  return v.x*v.x + v.y*v.y + v.z*v.z + v.w*v.w;


the compiler cannot optimize away the (.w)-value and hence not break the coalescing. However, this means that I waste one f32-mul for each square4_float4.

Is there a good way to force the compiler to not perform such optimizations?

And if not, how can I trick the compiler to believe that I’m using the (.w)-value, without performing an arithmetic-op on it?

ps. I have many similar situations in my code, threrefore I would be glad to find an easy solution to it. ds

Look at slide 25 from this presentation

You can use volatile to force the vector load

Will do,

thanks for the quick reply!

Did you also know that you can read float4’s full speed if you just bind a texture to the device memory and read with tex1Dfetch?

Thanks for the tip,

however, I try to keep data that I cannot use coalesced reads on in the texture cache. Hence, I am trying to avoid texture reads when it is not necessary, since I don’t want to pollute the texture cache.

But it is definitive worth investigating where/if it is possible to get a speed-up by using coalesced texture reads.

Does anybody know when/if the issue with reading float4 at full speed will be fixed?

I believe reading somewhere that the bug was closed in the database, so next release probably. Or maybe that was another bug, I’ll try to find it.

Normally, I would agree with you. The extra costs of using the texture unit mean it should not be used when a coalesced read is possible. BUT, in the case of float4 (and I assume int4, never tested it) you can get significantly increased throughputs (29.45 GiB/s vs 69.47 GiB/s) using the texture unit. It is worth it in this case.

See my performance tests of the various ways of reading device memory here.…41&#entry290441


I will give it a shot.