One could explicitly transfer data between the global memory and the local/shared memory, but it can make the code messy sometimes. I wonder if async_work_group_copy() is efficiently implemented to avoid this problem.
I tried the following code:
[codebox]__kernel attribute((reqd_work_group_size(32, 1, 1)))
void shuffle(__global int* data)
{
__local int x[32];
size_t lid = get_local_id(0);
size_t gid = get_global_id(0);
__local size_t oid;
if (lid == 0)
oid = gid;
barrier(CLK_LOCAL_MEM_FENCE);
x[(lid+1)&31] = data[gid];
data[gid] = x[lid];
/* async_work_group_copy(&data[oid], */
/* x, */
/* 32, */
/* 0); */
}
[/codebox]
It works unless I uncomment the async_work_group_copy() part. Does it mean that async_work_group_copy() is not implemented?