Cannot coalesce global memory reads using builtin vector types

Using Toolkit 3.0 on a SM1.1 GPU.

I’ve written simple kernels to read global mem into a local register and the profiler tells me int/float work fine, but float2, uchar4, int2 etc don’t coalesce.


[codebox]global void test( const float2* data )


float2 read = data[threadIdx.x];


Block size is 32x1 and the profiler reports there are 64 uncoalesced reads. Which makes me think it’s doing 2x4byte reads with 8byte alignment for every thread. If I force cast it to longlong1 then it reports 32 uncoalsced reads.

I have the same problem with char4. It won’t coalesce, but force casting it to integer fixes that.

There have been a few threads posted on this board that show others having the same problem in older toolkits with 1.1 cards. But there hasn’t been any definitive answer that I have found regarding a workaround or a reason why this is happening.

This should coalesce for float2, assuming the compiler isn’t splitting up a float2 load into two float loads. I would suggest trying this with a 3.1 toolkit and if the result is still the same, file a bug.