I have this thought. I have, say, 1024 data elements per block, blocksize 256. Each thread processes 4 elements (the whole 1024 are first copied into local). In the end, each thread outputs its 4 elements from local back into the global memory. Besides that, I need some 16 other values output to global memory. Now, I’d like your suggestions, if I could speed up the thing by using async_work_group_copy(local->global), since once the final 1024 elements are computed, there is another independent computation for the additional 16 values and it doesn’t matter, which set (‘1024’ or ‘16’) gets transferred first. In pseudo-code it is something like this:

1. Load 1024 elements per block, 4 per thread to local mem.

2. Do computations.

3. Store 1024 elements per block, 4 per thread back to global mem. // async_work_group_copy here???

4. Do some more computations.

5. Store 16 elements (of different type and semantics, not important) from local to global mem.

The documentation says that async_work_group_copy is executed by all work items (threads) - do they somehow split/share the range of the data to be transferred?

Might this be an compiler bug? 'Cuz AMD compiler compiles OK!

event_t localHistCopyEvent = async_work_group_copy((__global int2*)localHistCopy, (__local const int2*)localHist, 16, 0);

(the type-casting is there just to show you the types of the arguments)

Compiler error:

:705: error: no matching overload found for arguments of type 'int2 __attribute__((address_space(1)))*, int2 __attribute__((address_space(3)))const*, int, int'

  event_t localHistCopyEvent = async_work_group_copy((__global int2*)localHistCopy, (__local int2*)localHist, 16, 0);


Reference says it should support both from __global to __local and vice versa. Drivers 197.13. Now what? :o

EDIT: it accepts this code now:

event_t localHistCopyEvent = async_work_group_copy((__global int2*)localHistCopy, (const __local int2*)localHist, (size_t)16, (event_t)0);

Weird! :(

I am not sure whose documentation you refer to, but I am guessing nVidia’s. I wrote one of my kernels, such that input is copied in to local using both methods. I have a #define which I set that determines which method to use. After doing a lot of avg. timings (1000 runs after warming it up) the difference was very small. I recently re-ran both tests with on Fermi, but same result.

Coding both ways and using a compiler option to set the macro var may allow for optimizing on platforms which this has some effect (see Khronos board where Cell is mentioned in this regard).