My kernel takes an array of integers as an input. I bind the 1d texture on it in order to use tex1Dfetch to speed things up. Each thread works only with a fragment of the whole array. Say, if the input array consists of 100 elements then 1st thread will handle [0…9]elements, 2nd [10…19] e t c.
Each thread uses tex1Dfetch to acquire the data from the texture binded on the whole array. As each thread knows it’s absolute index in the grid, it computes the offset for the fetch like this:
nOffset = nThreadIndex * nStride + nPosition
Where nStride (for the case described above) is 10 and nPosition goes from 0 to nStride - 1 in order to enumerate all the elements of the fragment of the whole array.
The question is: according to the ‘spatial locality’ term, how is it better to organize the input array ? Like in the sample above ([Piece of data for the 1st thread], [Piece of data for the second thread] … all the pieces of data contain nStride elements) or like this:
[Set of zero elements from pieces of data for all threads], [Set of 1st elements from pieces of data for all threads], … each has also nStride elements.
For example: say, nStride == 3 and input array has data for three threads (1, 1, 1 for the first one; 2, 2, 2 for the second and 3, 3, 3 for the third). How is it better to organize the input array ?
111222333 (first described method) or 123123123 (second described method) ?
Thanks in advance.