efficient global to local memory transfer

I’ve got a kernel running with a workspace of 8x8 work items per work group. Each work item moves a 16 byte data structure into local memory before running a computation. I would prefer that the local data structure is an array of 16 uchars, which is why the code below has the long array assignment.

I am wondering if anyone knows what the most efficient way to move 16 bytes from global to local memory (for each work item)?

The code I’m currently using looks like:

uchar16 buff = global_array[index];

 uchar llid = (uchar)(get_local_id(0) + get_local_size(0)*get_local_id(1));

 int rIndex = llid*16;

 local_array[rIndex+0]   = tbuff.s0; local_array[rIndex+1]   = tbuff.s1; local_array[rIndex+2]   = tbuff.s2; local_array[rIndex+3]   = tbuff.s3; 

 local_array[rIndex+4]   = tbuff.s4; local_array[rIndex+5]   = tbuff.s5; local_array[rIndex+6]   = tbuff.s6; local_array[rIndex+7]   = tbuff.s7; 

 local_array[rIndex+8]   = tbuff.s8; local_array[rIndex+9]   = tbuff.s9; local_array[rIndex+10] = tbuff.sa; local_array[rIndex+11] = tbuff.sb; 

 local_array[rIndex+12] = tbuff.sc; local_array[rIndex+13] = tbuff.sd; local_array[rIndex+14] = tbuff.se; local_array[rIndex+15] = tbuff.sf;

where index is some int calculated based on the global id of the work item.

Thanks for taking a look at it.

Have you found that declaring
local uchar16 local_array[MY_WORKGROUP_SZ];
or arg passing
local uchar16 *local_array

both fail? If one of these works then the below syntax will at least be more compact:
local_array[ … ] = buff;

Gotta say that going from registers to local memory is probably pretty fast no matter how it is done. Important to have some type of sync (barrier or mem_fence) before proceeding after the assignment.

Have you found that declaring
local uchar16 local_array[MY_WORKGROUP_SZ];
or arg passing
local uchar16 *local_array

both fail? If one of these works then the below syntax will at least be more compact:
local_array[ … ] = buff;

Gotta say that going from registers to local memory is probably pretty fast no matter how it is done. Important to have some type of sync (barrier or mem_fence) before proceeding after the assignment.

Looking at your reply once again (after replying) I see buff is just an intermediary. Going to local from global can be done 3 ways:

  • global to private to local

  • global directly to local

  • event_t evt = async_work_group_copy(local_array, &global_array[…], (size_t) nbytes, (event_t) 0);

    wait_group_events(1, &evt);

You show a buff & a tbuff. Is there some calc being done prior to local ?

Looking at your reply once again (after replying) I see buff is just an intermediary. Going to local from global can be done 3 ways:

  • global to private to local

  • global directly to local

  • event_t evt = async_work_group_copy(local_array, &global_array[…], (size_t) nbytes, (event_t) 0);

    wait_group_events(1, &evt);

You show a buff & a tbuff. Is there some calc being done prior to local ?

I’m passing in the local uchar *local_array (i wasn’t aware that you can declare local buffers inside the kernel).

One caveat is that each work item is loading their own structure into local memory (a different slot, hence the rIndex variable added to each local_array index). The information loaded into local mem isn’t shared between work items, but stored there in order to cut down on register usage.

Can you explain that compact syntax (or point me to a reference)? Thanks for your response.

I’m passing in the local uchar *local_array (i wasn’t aware that you can declare local buffers inside the kernel).

One caveat is that each work item is loading their own structure into local memory (a different slot, hence the rIndex variable added to each local_array index). The information loaded into local mem isn’t shared between work items, but stored there in order to cut down on register usage.

Can you explain that compact syntax (or point me to a reference)? Thanks for your response.

basically, the elements of any built-in vector type, which uchar16 is one of, can be assigned, +=, *=, etc with other instances of vector types. They do not even need to have the same length. The … in the example is not actual code. Probably should have been something like:
uchar16 buff = global_array[index];
local uchar16 local_array[MY_WORKGROUP_SZ];
local_array[rIndex] = buff;

See Section 6.1.7 of the OpenCL Spec for other examples. They seem to show only the different length examples. I know for a fact that there is a += to a float4 in my code. The only trick might be accessing later, but it should be:
local_array[index].s0

As far as using local statically, I do that as well. As with all declared arrays though they must be statically size at compile time. If this cannot be done, you have to use the arg version.

basically, the elements of any built-in vector type, which uchar16 is one of, can be assigned, +=, *=, etc with other instances of vector types. They do not even need to have the same length. The … in the example is not actual code. Probably should have been something like:
uchar16 buff = global_array[index];
local uchar16 local_array[MY_WORKGROUP_SZ];
local_array[rIndex] = buff;

See Section 6.1.7 of the OpenCL Spec for other examples. They seem to show only the different length examples. I know for a fact that there is a += to a float4 in my code. The only trick might be accessing later, but it should be:
local_array[index].s0

As far as using local statically, I do that as well. As with all declared arrays though they must be statically size at compile time. If this cannot be done, you have to use the arg version.

Thanks, I’ll take a closer look at that section of the spec. Part of my issue is that I would like random, but iterable access (which I don’t think you can get with the vector types) in this data structure. For example, at some point I’ll need to calculate the index [0-15] for this local_array and then index it as such: local_array[i]. I can’t (and don’t know how) to do that on the fly while using a uchar16: local_array.s5 doesn’t leave me with that ability.

Basically what I’m looking for is a sort of memcopy, or the ability to cast a global or local vector to a global or local array. I know there’s an async copy function, but it isn’t too helpful because each thread has their own workload to simultaneous copy, so I don’t get a speedup.

But as you said before, the copy from register to local_mem is fast, so I think my current solution will stand unless I redesign how I use the local_array.

Thanks, I’ll take a closer look at that section of the spec. Part of my issue is that I would like random, but iterable access (which I don’t think you can get with the vector types) in this data structure. For example, at some point I’ll need to calculate the index [0-15] for this local_array and then index it as such: local_array[i]. I can’t (and don’t know how) to do that on the fly while using a uchar16: local_array.s5 doesn’t leave me with that ability.

Basically what I’m looking for is a sort of memcopy, or the ability to cast a global or local vector to a global or local array. I know there’s an async copy function, but it isn’t too helpful because each thread has their own workload to simultaneous copy, so I don’t get a speedup.

But as you said before, the copy from register to local_mem is fast, so I think my current solution will stand unless I redesign how I use the local_array.

I understand. The capability to do local_array[i] was in the Nvidia implementation, and as far as I know still in the OSX implementation. In 1.0, the wording in 6.1.7 had the text “The numeric indices must be preceded by the letter s or S”. Nvidia removed the ability, in maybe 196.

Had accidentally coded one that way, so I got a new compile error. I can bounce between Win/OSX/Linux with 0 changes since Java is my host language & do not even use a vendor’s SDK. Ran again on OSX and it still worked. Submitted bug there just to see what would happen. Got go away response. Do not know if wording has been softened for 1.1. I agree it could be very useful in some cases, but not really pleased different vendors putting in unique stuff.

I understand. The capability to do local_array[i] was in the Nvidia implementation, and as far as I know still in the OSX implementation. In 1.0, the wording in 6.1.7 had the text “The numeric indices must be preceded by the letter s or S”. Nvidia removed the ability, in maybe 196.

Had accidentally coded one that way, so I got a new compile error. I can bounce between Win/OSX/Linux with 0 changes since Java is my host language & do not even use a vendor’s SDK. Ran again on OSX and it still worked. Submitted bug there just to see what would happen. Got go away response. Do not know if wording has been softened for 1.1. I agree it could be very useful in some cases, but not really pleased different vendors putting in unique stuff.

You can cast to a pointer of the underlying scalar type and index that as usual. Works fine for me, at least.

You can cast to a pointer of the underlying scalar type and index that as usual. Works fine for me, at least.