I saw some people use shared memory in the kernel for simple global device data copy.
See we have *din, *dout in global device memory, instead using
dout[global_id] = din[global_id];
they copy din into shared memory, and assigned the shared memory value to output.
shared[local_id] = din[global_id];
dout[global_id] = shared[local_id];
In my view, anyway we need to read and write global memory, why should bother shared memory?
(1) software controlled cache; obviously this only make sense if there is data re-use
(2) passing (or sharing) data between threads in the same thread block (thus the name)
(3) on the fly re-layout of data to maximize global memory coalescing (for example, block-wise transpose)
None of these applies to scaling a vector in global memory, which is a simple streaming operation. So use of shared memory is not needed for that (and the overhead may hurt performance, as mfatica poined out).