tex1Dfetch and spatial locality How to understand 'spatial' term ?

Romant · September 16, 2008, 12:31pm

Hi All,

My kernel takes an array of integers as an input. I bind the 1d texture on it in order to use tex1Dfetch to speed things up. Each thread works only with a fragment of the whole array. Say, if the input array consists of 100 elements then 1st thread will handle [0…9]elements, 2nd [10…19] e t c.

Each thread uses tex1Dfetch to acquire the data from the texture binded on the whole array. As each thread knows it’s absolute index in the grid, it computes the offset for the fetch like this:

nOffset = nThreadIndex * nStride + nPosition

Where nStride (for the case described above) is 10 and nPosition goes from 0 to nStride - 1 in order to enumerate all the elements of the fragment of the whole array.

The question is: according to the ‘spatial locality’ term, how is it better to organize the input array ? Like in the sample above ([Piece of data for the 1st thread], [Piece of data for the second thread] … all the pieces of data contain nStride elements) or like this:

[Set of zero elements from pieces of data for all threads], [Set of 1st elements from pieces of data for all threads], … each has also nStride elements.

For example: say, nStride == 3 and input array has data for three threads (1, 1, 1 for the first one; 2, 2, 2 for the second and 3, 3, 3 for the third). How is it better to organize the input array ?

111222333 (first described method) or 123123123 (second described method) ?

Thanks in advance.

MisterAnderson42 · September 16, 2008, 12:43pm

The best accesses pattern is to have good spatial locallity within the accesses made simultaneously by a warp. So your 2nd method will be faster.

Is there a reason you can’t access this regularly spaced data coalesced w/o textures? Or is this an oversimplification of your real problem?

Romant · September 16, 2008, 5:01pm

Very significant oversimplification.

Each portion of data in this array describes an expression in reverse polish notation. Each portion of data may have different length, also, each expression may be very complex so the kernel goes through it many many times. As each expression must be evaluated on many data sets, usually all the threads in the warp evaluate same expression on different data.

My question is more theoretical than practical … due to the nature of data portions it is not easy to rearrange them in 123123 way.

MisterAnderson42 · September 19, 2008, 9:43pm

Well, you will get some benefit from the texture cache from the first way, too as long as the stride of elements between threads is not too large.

I’ve never actually gotten around to benchmarking texture read peformance vs the stride of the read, so I can’t tell you how large is too large, though… I’d guess 4 would be OK, but 12 might be getting too large, but that is only a guess.

Topic		Replies	Views
Using 1D Textures CUDA Programming and Performance	1	2417	November 28, 2008
1D versus 2D textures Spatial Locality CUDA Programming and Performance	2	4216	June 5, 2008
Performance Considerations using Texture Access Does the performance depend on the access pattern? CUDA Programming and Performance	1	1425	August 21, 2009
Texture cache architecture Line size of texture cache CUDA Programming and Performance	3	2986	August 27, 2008
Texture Memory Questions CUDA Programming and Performance	5	1340	August 4, 2010
About texture cache and spatial locality CUDA Programming and Performance	15	11507	July 24, 2009
Optimal 2D Locality Test for Texture Memory CUDA Programming and Performance	1	1554	September 9, 2009
I am trying to compare the performance of texture fetch and usual memory fetch CUDA Programming and Performance	10	2406	July 19, 2010
relative speed tex1Dfetch() v buffering in registers CUDA Programming and Performance	3	734	January 14, 2014
Fast texture fetching An algorithm for fast texture fetching CUDA Programming and Performance	12	15176	December 19, 2013

tex1Dfetch and spatial locality How to understand 'spatial' term ?

Related topics