I have another question regarding how to use shared memory. To simplify the case, say we have two arrays float A and float B. Each element value in B depends on values of three consecutive elements in A, e,g. B=a0A+a1A+a2A, B=a1A+a2A+a3A… How do I allocate appropriate shared memory for array A?
If you look at the separable convolution example in the SDK it addresses a similar problem. A solution is to add additional threads to your thread blocks. Each thread is responsible for reading a single value into the shared memory array. Then put in a conditional for the summing step to have the extra threads sit out for the rest of the kernel.