Cannot coalesce global memory reads using builtin vector types

war_head · July 10, 2010, 2:59pm

Using Toolkit 3.0 on a SM1.1 GPU.

I’ve written simple kernels to read global mem into a local register and the profiler tells me int/float work fine, but float2, uchar4, int2 etc don’t coalesce.

Eg.

[codebox]global void test( const float2* data )

{

float2 read = data[threadIdx.x];

}[/codebox]

Block size is 32x1 and the profiler reports there are 64 uncoalesced reads. Which makes me think it’s doing 2x4byte reads with 8byte alignment for every thread. If I force cast it to longlong1 then it reports 32 uncoalsced reads.

I have the same problem with char4. It won’t coalesce, but force casting it to integer fixes that.

There have been a few threads posted on this board that show others having the same problem in older toolkits with 1.1 cards. But there hasn’t been any definitive answer that I have found regarding a workaround or a reason why this is happening.

cbuchner1 · July 10, 2010, 6:11pm

For coalescing on CC 1.1, the data must be aligned to multiples of 16 * sizeof(float2), which is 16*8=128 bytes. You should see

128 byte memory transactions indicated in Visual Profiler (which also happens to be the fastest transaction).

Usually memory blocks allocated with cudaMalloc fulfill this requirement (I think this will be multiples of 256 bytes, not entirely sure).

If you were using a local variable (e.g int index = threadIdx.x), take care not to declare this variable as volatile, as it would break coalescing. Using the volatile keyword can help to get register count down - but used in vector loads it breaks coalescing.

If these tips don’t help, have you tried downgrading to CUDA 2.3 ?

war_head · July 12, 2010, 8:14am

I tried the memory allocation with cudaMallocPitch but it made no difference. ( possibly because I have blockDim.y = 1 )

Variable is not declared as volatile. Cuda 2.3 gives me the same issue, reading a float2 with 32 threads still reports 64 uncoalesced reads.

added PTX output

[codebox] .entry _Z12coalescetestPK6float2 (

	.param .u32 __cudaparm__Z12coalescetestPK6float2_data)

{

.reg .u32 %r<10>;

.reg .f32 %f<4>;

.local .align 8 .b8 __cuda_read_0[8];

.loc	18	15	0

$LBB1__Z12coalescetestPK6float2:

$Lt_0_258:

.loc	18	17	0

ld.param.u32 	%r1, [__cudaparm__Z12coalescetestPK6float2_data];

cvt.u32.u16 	%r2, %tid.x;

mul.lo.u32 	%r3, %r2, 8;

add.u32 	%r4, %r1, %r3;

ld.global.f32 	%f1, [%r4+0];

st.local.f32 	[__cuda_read_0+0], %f1;

ld.param.u32 	%r5, [__cudaparm__Z12coalescetestPK6float2_data];

cvt.u32.u16 	%r6, %tid.x;

mul.lo.u32 	%r7, %r6, 8;

add.u32 	%r8, %r5, %r7;

ld.global.f32 	%f2, [%r8+4];

st.local.f32 	[__cuda_read_0+4], %f2;

$Lt_0_514:

.loc	18	18	0

exit;

$LDWend__Z12coalescetestPK6float2:

} // _Z12coalescetestPK6float2[/codebox]

cbuchner1 · July 12, 2010, 1:11pm

try dropping the const keyword !?

war_head · July 12, 2010, 1:16pm

Yep tried that too. I’ve found that I can type cast it to longlong1 for native 8byte and that seems to work. My first attempt at doing this had a bad offset so it was uncoalesced, tried with zero offset and it works.

ihaque · July 13, 2010, 5:26pm

Try declaring your pointer as restricted (eg, float2 restrict *x). Your PTX output has some nonsensical local memory usage in it that reminds me of lmem usage I saw when dealing with float4s in shared memory - using a restricted pointer solved it. There are some alias analysis bugs in nvcc; I wonder if this is another.

war_head · July 14, 2010, 7:54am

No change :(

Topic		Replies	Views
Cannot coalesce global memory reads using builtin vector types CUDA Programming and Performance	1	1241	July 12, 2010
Isn't that Coalesced?! writing to global memory in a coalesced way CUDA Programming and Performance	9	10170	June 28, 2009
Float4 must read adjacent element? Can we modify it for coalesced reading? CUDA Programming and Performance	7	854	May 11, 2022
why is it uncoalesced ? SDK example simpleGL CUDA Programming and Performance	9	13655	February 3, 2011
Coalescing memory accesses Need help with coalescing CUDA Programming and Performance	2	1164	March 30, 2009
Help me about coalescing my program run too slow CUDA Programming and Performance	5	2913	May 14, 2008
hwo to make float2 and float4 data coalesced? CUDA Programming and Performance	1	3541	May 27, 2008
Uncoalesced reads; Coalesced writes Same access pattern; differenct coalesced I/O outcome? CUDA Programming and Performance	5	3215	December 12, 2011
Problems with coalescing memory accesses CUDA Programming and Performance	4	3770	August 26, 2009
Additional requirements for coalesced reads/writes structs must map to built-in type CUDA Programming and Performance	7	3139	December 31, 2007

Cannot coalesce global memory reads using builtin vector types

Related topics