Unless you’re using device emulation, your method will not work at all–the GPU’s memory is not directly addressable by the CPU. It’ll probably segfault and crash.
I have another question regarding how to use shared memory. To simplify the case, say we have two arrays float A[10] and float B[10]. Each element value in B depends on values of three consecutive elements in A, e,g. B[1]=a0A[0]+a1A[1]+a2A[2], B[2]=a1A[1]+a2A[2]+a3A[3]… How do I allocate appropriate shared memory for array A?
If you look at the separable convolution example in the SDK it addresses a similar problem. A solution is to add additional threads to your thread blocks. Each thread is responsible for reading a single value into the shared memory array. Then put in a conditional for the summing step to have the extra threads sit out for the rest of the kernel.