Using Toolkit 3.0 on a SM1.1 GPU.
I’ve written simple kernels to read global mem into a local register and the profiler tells me int/float work fine, but float2, uchar4, int2 etc don’t coalesce.
Eg.
[codebox]global void test( const float2* data )
{
float2 read = data[threadIdx.x];
}[/codebox]
Block size is 32x1 and the profiler reports there are 64 uncoalesced reads. Which makes me think it’s doing 2x4byte reads with 8byte alignment for every thread. If I force cast it to longlong1 then it reports 32 uncoalsced reads.
I have the same problem with char4. It won’t coalesce, but force casting it to integer fixes that.
There have been a few threads posted on this board that show others having the same problem in older toolkits with 1.1 cards. But there hasn’t been any definitive answer that I have found regarding a workaround or a reason why this is happening.