simple global data copy using shared memory why bother shared memory when simply copy global data

Hi, All:

I saw some people use shared memory in the kernel for simple global device data copy.

See we have *din, *dout in global device memory, instead using
dout[global_id] = din[global_id];

they copy din into shared memory, and assigned the shared memory value to output.
shared[local_id] = din[global_id];
dout[global_id] = shared[local_id];

In my view, anyway we need to read and write global memory, why should bother shared memory?

Is there any performance gain to do that?

Thanks

You are correct , there is no gain in this case (possibly even a slow down).

If we do simple scale multiplication, dout[global_id] = din[global_id]*global_id.
do we gain anything using shared memory? I guess still not.

The main uses of shared memory are:

(1) software controlled cache; obviously this only make sense if there is data re-use
(2) passing (or sharing) data between threads in the same thread block (thus the name)
(3) on the fly re-layout of data to maximize global memory coalescing (for example, block-wise transpose)

None of these applies to scaling a vector in global memory, which is a simple streaming operation. So use of shared memory is not needed for that (and the overhead may hurt performance, as mfatica poined out).

thanks a lot.