Suppose I have a kernel (whose header is given below) which is written below, Now How should I effectively use shared memory to gain further performance.
Right now, I am accessing the elements using global memory.
func<<625,256>>(float* outptr, float* arguments).
. and my output array size being 1,6,0000 element.
Now In my application each array element is processed by a single thread index in random fashon. ( thread 1 is processing array 1, or thread 2 is processing say array 3 and so on. As I am using global memory, which this process takes a huge amount of time.
How can I keep my elements in shared memory and gain performance.