Ok, maybe I am using the wrong phrase, basically my question is that if i have an array of data and i want to perform some operation where every thread will need to access two elements in the same array.
how can i arrange the memory so that every thread in each block can compute the average. obviousely the last thread of the block cant access the array[thx+1] and that is the porblem. I know for this case we can use overlapping but what if what we are trying to compute is the following:
here we wont be able to use overlapping as all the values of each block needs to be calculated and then the last two elements needs to be sent to next block.
I was thinking then we should call the kernel in a loop and have only one block. I was wondering if there is any better way of doing it.
I hope I was able to make my point :)