Fast texture fetching An algorithm for fast texture fetching

Dear all!
I have a question about texture fetching. As it’s written in a programming guide and discussed a lot at the present forum the speed of tex1Dfetch inside the kernel is highly dependent on the spatial locality of those fetches. In other words all kernel’s fetches must be performed from nearby places in global memory to provide maximal speed.
In my case each kernel performs several fetches from the texture binded to the 1D global array which corresponds to 3D geometrical points. Moreover all the kernel’s fetches are performed for the points which are geometrical neighbours. So there is a question if there is any CPU or GPU implemented algorithm which performs an optimal 3D points renumbering in the sence that geometrical neighbours become neighbours in a corresponded 1D global array.
Thank you in advance.

If you want to optimally fetch elements that have a 3D geometrical arrangement you should use 3D textures, not 1D textures. To use 3D textures you will first need to copy your 1D linear global array to a 3D cuda array and then bind a 3D texture to that 3D cuda array. You then use tex3Dfetch inside kernels. It’s all described in more detail in the programming guide.

Copying 1D memory to a 3D cuda array renumbers the indexes.

Also, 3D textures were introduced in CUDA 2.0.

Of course. But only if the data is perfectly arranged on a 3D grid. If the points in the data set are somewhat randomly distributed, 3D textures are out of the question.

Yes. You can reorder the points based on a space filling curve. The hilbert curve works best in my testing, but the much simpler Morton order curve is a close 2nd.

Our paper discusses the algorithm we use and contains several references to other methods. You can find the reference or download the preprint from

From an implementation standpoint, this is definitely correct.

To answer the question of what the algorithm is/what the 3D texture is doing for you behind your back is that it is uses a space filling curve, probably similar to this one:

Thank you for the answers.
Actually MisterAnderson42 is completely right. :thumbup: 3D textures are’t my case definitely however I’ve thought about them ) The reason is that all my 3D points are just an output of some 3D mesh generator which gives just a 1D array of arbitrary distributed mesh nodes.
I read about space filling Z-curve at wikipedia as one of the possible approaches however googling gave several links about more complicated approaches. For example it’s easy to demonstrate that my question is equivalent to the well-known problem of a sparse matrix bandwidth reduction. And the best of the suggested approaches is based on the reverse Cuthill-McKee algorithm. However I’ve met an article where guys solve our problem just searching different variants of numbering at the Cray cluster - such a brutal preprocessing B)
Anyway my question was about an implemented CPU or GPU renumbering algorithm (available code will be the best option). As far as I understand I should invent a bicycle myself. :(

It’s actually pretty easy, assuming your points are guarunteed to be bounded within a reasonably sized box. I just assign the points to a grid and then walk through the grid based on the space filling curve, renumbering particles as I go. Of course, this is done on the CPU and the time it takes is absolutely tiny, given that I precomute the space filling curve and reuse it many times for the same grid.

If your points aren’t bounded by a nice box, check the references from our paper. One of them has a nice algorithm for the Hilbert curve to recursively identify where each point belongs without ever having to generate the complete curve in memory.

Thanks I’ll look through.
An approach based on the space filling curve seems to be realistic.
Last question. MisterAnderson42, I’ve read in one of the forum’s posts that your renumbering gave you a resulted speedup of almost 5 times! :w00t: Is it really so that just nodes’ renumbering can provide such a drastic speedup?
P.S. In case someone is interested in bandwith reduction algorithms which can be applied for an optimal renumbering I can send found articles via email.

Yep! Although the speedup depends on the size of the system. Here is a quick benchmark run I just did for 64,000 particles. The only difference is that one has the data sort and the other does not.

./force_compute_bmark --half_nlist=0 -N 64000 -q --sort=1

0.003538443 s/step

joaander@pion ~/hoomd/bin/benchmarks $ ./force_compute_bmark --half_nlist=0 -N 64000 -q --sort=0

0.01123575 s/step

joaander@pion ~/hoomd/bin/benchmarks $ ./force_compute_bmark --half_nlist=0 -N 300000 -q --sort=1

That is 3.2x faster with the sort.

With 300,000 particles, the speedup is greater.

joaander@pion ~/hoomd/bin/benchmarks $ ./force_compute_bmark --half_nlist=0 -N 300000 -q --sort=1

0.01508135 s/step

joaander@pion ~/hoomd/bin/benchmarks $ ./force_compute_bmark --half_nlist=0 -N 300000 -q --sort=0

0.06657493 s/step

That is 4.41x faster (hmmm… my 5x number must have come from an even larger system size)

This benchmark is a run of a kernel where each thread accesses ~90 nearby particles to sum a force. The “unsorted” benchmarks have their particles placed by a random number generator, so the reads for a single thread are likely to be distributed equally among the entire array. The benchmarks with sort=1 have the hilbert curve sort applied to the same randomly placed particles.

YMMV of course, depending on how many neighbors you are accessing in each thread and how “random” the data was to begin with. In my application the random data is completley realistic of a real simulation, since the particles diffuse over time (hence why I need to apply the sort more than one time to keep them in check).

MisterAnderson42–I know this is an old post, but I have just started working on a problem that is applicable to this (old) thread. When you say that if the data is randomly distributed then 3D textures are out of the question, is that because there will be a lot of cache thrashing as the space filling curve will not be able to map all the data points in the given cache size? I have a problem that has good 2D (x,y) spatial locality, but poor locality in the 3rd dimension (z). I have been trying to use 3D Textures and Surfaces to access the data, but I have seen horrible timing results–I am assuming this is because of the poor locality in the z-dimension. Do you think that would make sense?


If you have poor locality in the third dimension, you should take a look at layered textures in the CUDA Programming Guide. They aren’t totally clear on the storage layout, but it sounds like they do a 2D morton ordering in x and y, but not in the z-dimension. Some quick experiments should indicate whether this will help.

However if there is poor locality in any one of the coordinates (e.g. z), there is poor locality overall. No kind of cache is going to help there unless the task is to return data with an arbitrary z-coordinate, or locality in z can be established by some other means (reordering, sorting, …)

To clarify the apparent confusion in this dinosaur post. 3D textures can only be used to access data on a lattice v[i,j,k]. The types of applications I was referring to above where 3D textures don’t work store data float3 p[i] - where p[i] is a position vector to a point lying inside a box. 3D textures simply cannot be used to access p by x,y,z coordinates, you need to use tex1Dfetch and order the data on a space filling curve.

I could not find tex3Dfetch.

We are currently forcing a three dimensional volume of data into one dimension and
using tex1Dfetch. With 3.5 level GPUs the performance profiler nvvp says we are only
getting 90% cache hits. I worry that tex1Dfetch means adjacent threads access data with
similar z and y at very different positions in the 1D array. The data are float4.
The indexes are integers.
What are the downsides of using tex1D v. tex1Dfetch? Would these carry over to tex3D()

Any help or advice would be most welcome. Thank you.