gbl32/64/128 coalescing doubt

The gbl32/64/128 memory coalescing mentioned in the optimization PDF docs… are bits or bytes?

For instance, what gives more bandwith and it’s better?

This

__kernel void myKernel32Bits ( __global float *outData ) /* coalescing = 32 bits */

{

   outData[get_global_id(0)] = 1.0f;

}

__kernel void myKernel64Bits ( __global float2 *outData ) /* coalescing = 64 bits */

{

   outData[get_global_id(0)] = (float2)(1.0f,2.0f);

}

__kernel void myKernel128Bits ( __global float4 *outData )  /* coalescing = 128 bits */

{

   outData[get_global_id(0)] = (float4)(1.0f,2.0f,3.0f,4.0f);

}

( is the memory well aligned/coalesced here? )

or is it better this?

__kernel void myKernel32Bytes ( __global float8 *outData ) /* coalescing = 32 bytes, note BYTES not BITS */

{

   outData[get_global_id(0)] = (float8)(1.0f,2.0f,3.0f,4.0f,5.0f,6.0f,7.0f);

}

__kernel void myKernel128Bytes ( __global float16 *outData ) /* coalescing = 128 bytes, note BYTES not BITS */

{

   outData[get_global_id(0)] = (float16)(127.0f);

}

As far as I know you need to use 32 or 64 bit reads/writes for best performance. They both should perform about similar.

Basically, the hardware always reads/writes memory in big chunks (about 64-128 bytes), but a single work-item can only access a few bytes (probably 8 bytes, for a long or double) at a time. The idea behind coalescing is to combine concurrent reads/writes of several work-items into one big read/write of such a 64-128 byte chunk.

That means accessing big vectors within a single work-item is (usually) not a good idea if you’re using __global memory.

As far as I know you need to use 32 or 64 bit reads/writes for best performance. They both should perform about similar.

Basically, the hardware always reads/writes memory in big chunks (about 64-128 bytes), but a single work-item can only access a few bytes (probably 8 bytes, for a long or double) at a time. The idea behind coalescing is to combine concurrent reads/writes of several work-items into one big read/write of such a 64-128 byte chunk.

That means accessing big vectors within a single work-item is (usually) not a good idea if you’re using __global memory.

Ok, a last question: And what do you think would be better in theory? To read a structure with 2x float4s, 1 float8 or 1+1 float4? Examples:

typedef struct tData

{

	float4 a, b;

}Data;

__kernel void Kernel1 ( const __global Data *data )

{

	const Data d = data[get_global_id(0)];

	...

	Use d.a and d.b

}

__kernel void Kernel2 ( const __global float8 *data )

{

	const float8 d = data[get_global_id(0)];

	...

	Use d.lo and d.hi

}

__kernel void Kernel3 ( const __global float4 *dataA, const__global float4 *dataB )

{

	const float4 a = dataA[get_global_id(0)];

	const float4 b = dataB[get_global_id(0)];

	...

	Use a and b

}

I use 128 threads/block, not sure if matters.

Ok, a last question: And what do you think would be better in theory? To read a structure with 2x float4s, 1 float8 or 1+1 float4? Examples:

typedef struct tData

{

	float4 a, b;

}Data;

__kernel void Kernel1 ( const __global Data *data )

{

	const Data d = data[get_global_id(0)];

	...

	Use d.a and d.b

}

__kernel void Kernel2 ( const __global float8 *data )

{

	const float8 d = data[get_global_id(0)];

	...

	Use d.lo and d.hi

}

__kernel void Kernel3 ( const __global float4 *dataA, const__global float4 *dataB )

{

	const float4 a = dataA[get_global_id(0)];

	const float4 b = dataB[get_global_id(0)];

	...

	Use a and b

}

I use 128 threads/block, not sure if matters.

I do not know how they perform now, or may be improved in the future, but there are built-in Vector Data Load and Store Functions (vloadN & vstoreN). It seems to me that a vendor could optimize such operations, if they wanted to. They are documented in 6.11.7

I do not know how they perform now, or may be improved in the future, but there are built-in Vector Data Load and Store Functions (vloadN & vstoreN). It seems to me that a vendor could optimize such operations, if they wanted to. They are documented in 6.11.7