One could explicitly transfer data between the global memory and the local/shared memory, but it can make the code messy sometimes. I wonder if async_work_group_copy() is efficiently implemented to avoid this problem.

I tried the following code:

[codebox]__kernel attribute((reqd_work_group_size(32, 1, 1)))

void shuffle(__global int* data)


__local int x[32];

size_t lid = get_local_id(0);

size_t gid = get_global_id(0);

__local size_t oid;

if (lid == 0)

oid = gid;


x[(lid+1)&31] = data[gid];

data[gid] = x[lid];

/* async_work_group_copy(&data[oid], */

/* x, */

/* 32, */

/* 0); */



It works unless I uncomment the async_work_group_copy() part. Does it mean that async_work_group_copy() is not implemented?