tex1Dfetch and spatial locality How to understand 'spatial' term ?

Hi All,

My kernel takes an array of integers as an input. I bind the 1d texture on it in order to use tex1Dfetch to speed things up. Each thread works only with a fragment of the whole array. Say, if the input array consists of 100 elements then 1st thread will handle [0…9]elements, 2nd [10…19] e t c.

Each thread uses tex1Dfetch to acquire the data from the texture binded on the whole array. As each thread knows it’s absolute index in the grid, it computes the offset for the fetch like this:

nOffset = nThreadIndex * nStride + nPosition

Where nStride (for the case described above) is 10 and nPosition goes from 0 to nStride - 1 in order to enumerate all the elements of the fragment of the whole array.

The question is: according to the ‘spatial locality’ term, how is it better to organize the input array ? Like in the sample above ([Piece of data for the 1st thread], [Piece of data for the second thread] … all the pieces of data contain nStride elements) or like this:

[Set of zero elements from pieces of data for all threads], [Set of 1st elements from pieces of data for all threads], … each has also nStride elements.

For example: say, nStride == 3 and input array has data for three threads (1, 1, 1 for the first one; 2, 2, 2 for the second and 3, 3, 3 for the third). How is it better to organize the input array ?

111222333 (first described method) or 123123123 (second described method) ?

Thanks in advance.

The best accesses pattern is to have good spatial locallity within the accesses made simultaneously by a warp. So your 2nd method will be faster.

Is there a reason you can’t access this regularly spaced data coalesced w/o textures? Or is this an oversimplification of your real problem?

Very significant oversimplification.

Each portion of data in this array describes an expression in reverse polish notation. Each portion of data may have different length, also, each expression may be very complex so the kernel goes through it many many times. As each expression must be evaluated on many data sets, usually all the threads in the warp evaluate same expression on different data.

My question is more theoretical than practical … due to the nature of data portions it is not easy to rearrange them in 123123 way.

Well, you will get some benefit from the texture cache from the first way, too as long as the stride of elements between threads is not too large.

I’ve never actually gotten around to benchmarking texture read peformance vs the stride of the read, so I can’t tell you how large is too large, though… I’d guess 4 would be OK, but 12 might be getting too large, but that is only a guess.