The gbl32/64/128 memory coalescing mentioned in the optimization PDF docs… are bits or bytes?
For instance, what gives more bandwith and it’s better?
This
__kernel void myKernel32Bits ( __global float *outData ) /* coalescing = 32 bits */
{
outData[get_global_id(0)] = 1.0f;
}
__kernel void myKernel64Bits ( __global float2 *outData ) /* coalescing = 64 bits */
{
outData[get_global_id(0)] = (float2)(1.0f,2.0f);
}
__kernel void myKernel128Bits ( __global float4 *outData ) /* coalescing = 128 bits */
{
outData[get_global_id(0)] = (float4)(1.0f,2.0f,3.0f,4.0f);
}
( is the memory well aligned/coalesced here? )
or is it better this?
__kernel void myKernel32Bytes ( __global float8 *outData ) /* coalescing = 32 bytes, note BYTES not BITS */
{
outData[get_global_id(0)] = (float8)(1.0f,2.0f,3.0f,4.0f,5.0f,6.0f,7.0f);
}
__kernel void myKernel128Bytes ( __global float16 *outData ) /* coalescing = 128 bytes, note BYTES not BITS */
{
outData[get_global_id(0)] = (float16)(127.0f);
}
As far as I know you need to use 32 or 64 bit reads/writes for best performance. They both should perform about similar.
Basically, the hardware always reads/writes memory in big chunks (about 64-128 bytes), but a single work-item can only access a few bytes (probably 8 bytes, for a long or double) at a time. The idea behind coalescing is to combine concurrent reads/writes of several work-items into one big read/write of such a 64-128 byte chunk.
That means accessing big vectors within a single work-item is (usually) not a good idea if you’re using __global memory.
As far as I know you need to use 32 or 64 bit reads/writes for best performance. They both should perform about similar.
Basically, the hardware always reads/writes memory in big chunks (about 64-128 bytes), but a single work-item can only access a few bytes (probably 8 bytes, for a long or double) at a time. The idea behind coalescing is to combine concurrent reads/writes of several work-items into one big read/write of such a 64-128 byte chunk.
That means accessing big vectors within a single work-item is (usually) not a good idea if you’re using __global memory.
Ok, a last question: And what do you think would be better in theory? To read a structure with 2x float4s, 1 float8 or 1+1 float4? Examples:
typedef struct tData
{
float4 a, b;
}Data;
__kernel void Kernel1 ( const __global Data *data )
{
const Data d = data[get_global_id(0)];
...
Use d.a and d.b
}
__kernel void Kernel2 ( const __global float8 *data )
{
const float8 d = data[get_global_id(0)];
...
Use d.lo and d.hi
}
__kernel void Kernel3 ( const __global float4 *dataA, const__global float4 *dataB )
{
const float4 a = dataA[get_global_id(0)];
const float4 b = dataB[get_global_id(0)];
...
Use a and b
}
I use 128 threads/block, not sure if matters.
Ok, a last question: And what do you think would be better in theory? To read a structure with 2x float4s, 1 float8 or 1+1 float4? Examples:
typedef struct tData
{
float4 a, b;
}Data;
__kernel void Kernel1 ( const __global Data *data )
{
const Data d = data[get_global_id(0)];
...
Use d.a and d.b
}
__kernel void Kernel2 ( const __global float8 *data )
{
const float8 d = data[get_global_id(0)];
...
Use d.lo and d.hi
}
__kernel void Kernel3 ( const __global float4 *dataA, const__global float4 *dataB )
{
const float4 a = dataA[get_global_id(0)];
const float4 b = dataB[get_global_id(0)];
...
Use a and b
}
I use 128 threads/block, not sure if matters.
I do not know how they perform now, or may be improved in the future, but there are built-in Vector Data Load and Store Functions (vloadN & vstoreN). It seems to me that a vendor could optimize such operations, if they wanted to. They are documented in 6.11.7
I do not know how they perform now, or may be improved in the future, but there are built-in Vector Data Load and Store Functions (vloadN & vstoreN). It seems to me that a vendor could optimize such operations, if they wanted to. They are documented in 6.11.7