Cannot coalesce global memory reads using builtin vector types

Using Toolkit 3.0 on a SM1.1 GPU.

I’ve written simple kernels to read global mem into a local register and the profiler tells me int/float work fine, but float2, uchar4, int2 etc don’t coalesce.


[codebox]global void test( const float2* data )


float2 read = data[threadIdx.x];


Block size is 32x1 and the profiler reports there are 64 uncoalesced reads. Which makes me think it’s doing 2x4byte reads with 8byte alignment for every thread. If I force cast it to longlong1 then it reports 32 uncoalsced reads.

I have the same problem with char4. It won’t coalesce, but force casting it to integer fixes that.

There have been a few threads posted on this board that show others having the same problem in older toolkits with 1.1 cards. But there hasn’t been any definitive answer that I have found regarding a workaround or a reason why this is happening.

For coalescing on CC 1.1, the data must be aligned to multiples of 16 * sizeof(float2), which is 16*8=128 bytes. You should see

128 byte memory transactions indicated in Visual Profiler (which also happens to be the fastest transaction).

Usually memory blocks allocated with cudaMalloc fulfill this requirement (I think this will be multiples of 256 bytes, not entirely sure).

If you were using a local variable (e.g int index = threadIdx.x), take care not to declare this variable as volatile, as it would break coalescing. Using the volatile keyword can help to get register count down - but used in vector loads it breaks coalescing.

If these tips don’t help, have you tried downgrading to CUDA 2.3 ?

I tried the memory allocation with cudaMallocPitch but it made no difference. ( possibly because I have blockDim.y = 1 )

Variable is not declared as volatile. Cuda 2.3 gives me the same issue, reading a float2 with 32 threads still reports 64 uncoalesced reads.

added PTX output

[codebox] .entry _Z12coalescetestPK6float2 (

	.param .u32 __cudaparm__Z12coalescetestPK6float2_data)


.reg .u32 %r<10>;

.reg .f32 %f<4>;

.local .align 8 .b8 __cuda_read_0[8];

.loc	18	15	0



.loc	18	17	0

ld.param.u32 	%r1, [__cudaparm__Z12coalescetestPK6float2_data];

cvt.u32.u16 	%r2, %tid.x;

mul.lo.u32 	%r3, %r2, 8;

add.u32 	%r4, %r1, %r3; 	%f1, [%r4+0];

st.local.f32 	[__cuda_read_0+0], %f1;

ld.param.u32 	%r5, [__cudaparm__Z12coalescetestPK6float2_data];

cvt.u32.u16 	%r6, %tid.x;

mul.lo.u32 	%r7, %r6, 8;

add.u32 	%r8, %r5, %r7; 	%f2, [%r8+4];

st.local.f32 	[__cuda_read_0+4], %f2;


.loc	18	18	0



} // _Z12coalescetestPK6float2[/codebox]

try dropping the const keyword !?

Yep tried that too. I’ve found that I can type cast it to longlong1 for native 8byte and that seems to work. My first attempt at doing this had a bad offset so it was uncoalesced, tried with zero offset and it works.

Try declaring your pointer as restricted (eg, float2 restrict *x). Your PTX output has some nonsensical local memory usage in it that reminds me of lmem usage I saw when dealing with float4s in shared memory - using a restricted pointer solved it. There are some alias analysis bugs in nvcc; I wonder if this is another.

No change :(