gbl32/64/128 coalescing doubt

santyhyammer · September 12, 2010, 2:41am

The gbl32/64/128 memory coalescing mentioned in the optimization PDF docs… are bits or bytes?

For instance, what gives more bandwith and it’s better?

This

__kernel void myKernel32Bits ( __global float *outData ) /* coalescing = 32 bits */

{

   outData[get_global_id(0)] = 1.0f;

}

__kernel void myKernel64Bits ( __global float2 *outData ) /* coalescing = 64 bits */

{

   outData[get_global_id(0)] = (float2)(1.0f,2.0f);

}

__kernel void myKernel128Bits ( __global float4 *outData )  /* coalescing = 128 bits */

{

   outData[get_global_id(0)] = (float4)(1.0f,2.0f,3.0f,4.0f);

}

( is the memory well aligned/coalesced here? )

or is it better this?

__kernel void myKernel32Bytes ( __global float8 *outData ) /* coalescing = 32 bytes, note BYTES not BITS */

{

   outData[get_global_id(0)] = (float8)(1.0f,2.0f,3.0f,4.0f,5.0f,6.0f,7.0f);

}

__kernel void myKernel128Bytes ( __global float16 *outData ) /* coalescing = 128 bytes, note BYTES not BITS */

{

   outData[get_global_id(0)] = (float16)(127.0f);

}

daemonized · September 12, 2010, 3:02am

As far as I know you need to use 32 or 64 bit reads/writes for best performance. They both should perform about similar.

Basically, the hardware always reads/writes memory in big chunks (about 64-128 bytes), but a single work-item can only access a few bytes (probably 8 bytes, for a long or double) at a time. The idea behind coalescing is to combine concurrent reads/writes of several work-items into one big read/write of such a 64-128 byte chunk.

That means accessing big vectors within a single work-item is (usually) not a good idea if you’re using __global memory.

daemonized · September 12, 2010, 3:02am

As far as I know you need to use 32 or 64 bit reads/writes for best performance. They both should perform about similar.

Basically, the hardware always reads/writes memory in big chunks (about 64-128 bytes), but a single work-item can only access a few bytes (probably 8 bytes, for a long or double) at a time. The idea behind coalescing is to combine concurrent reads/writes of several work-items into one big read/write of such a 64-128 byte chunk.

That means accessing big vectors within a single work-item is (usually) not a good idea if you’re using __global memory.

santyhyammer · September 12, 2010, 4:34am

Ok, a last question: And what do you think would be better in theory? To read a structure with 2x float4s, 1 float8 or 1+1 float4? Examples:

typedef struct tData

{

	float4 a, b;

}Data;

__kernel void Kernel1 ( const __global Data *data )

{

	const Data d = data[get_global_id(0)];

	...

	Use d.a and d.b

}

__kernel void Kernel2 ( const __global float8 *data )

{

	const float8 d = data[get_global_id(0)];

	...

	Use d.lo and d.hi

}

__kernel void Kernel3 ( const __global float4 *dataA, const__global float4 *dataB )

{

	const float4 a = dataA[get_global_id(0)];

	const float4 b = dataB[get_global_id(0)];

	...

	Use a and b

}

I use 128 threads/block, not sure if matters.

santyhyammer · September 12, 2010, 4:34am

Ok, a last question: And what do you think would be better in theory? To read a structure with 2x float4s, 1 float8 or 1+1 float4? Examples:

typedef struct tData

{

	float4 a, b;

}Data;

__kernel void Kernel1 ( const __global Data *data )

{

	const Data d = data[get_global_id(0)];

	...

	Use d.a and d.b

}

__kernel void Kernel2 ( const __global float8 *data )

{

	const float8 d = data[get_global_id(0)];

	...

	Use d.lo and d.hi

}

__kernel void Kernel3 ( const __global float4 *dataA, const__global float4 *dataB )

{

	const float4 a = dataA[get_global_id(0)];

	const float4 b = dataB[get_global_id(0)];

	...

	Use a and b

}

I use 128 threads/block, not sure if matters.

jcpalmer · September 13, 2010, 9:03pm

I do not know how they perform now, or may be improved in the future, but there are built-in Vector Data Load and Store Functions (vloadN & vstoreN). It seems to me that a vendor could optimize such operations, if they wanted to. They are documented in 6.11.7

jcpalmer · September 13, 2010, 9:03pm

I do not know how they perform now, or may be improved in the future, but there are built-in Vector Data Load and Store Functions (vloadN & vstoreN). It seems to me that a vendor could optimize such operations, if they wanted to. They are documented in 6.11.7

Topic		Replies	Views
Memory coalescing and multiple arrays CUDA Programming and Performance	23	11681	March 20, 2009
Local memory performance Using more than 4kb kills it.. why? CUDA Programming and Performance	24	5078	September 6, 2008
global memory latency CUDA Programming and Performance	12	16169	December 13, 2007
help me! coalesced access CUDA Programming and Performance	15	7014	October 27, 2010
Register / Shared memory question memory copy max performance CUDA Programming and Performance	6	8146	September 13, 2009
Require clarification for Memory coalescing? CUDA Programming and Performance hw , cuda	4	1692	October 5, 2023
Coalesced read/write memory details More informations about coalesced memory CUDA Programming and Performance	10	12315	March 2, 2008
efficient global memory access 32-, 64- or 128-bit loads ? CUDA Programming and Performance	9	4692	January 7, 2008
Please help with __shared__ memory different usage than in samples CUDA Programming and Performance	30	3313	January 10, 2010
Slow Performance CUDA Programming and Performance	19	8126	November 24, 2008

gbl32/64/128 coalescing doubt

Related topics