Global versus Texture Memory - no speedup I'm not getting any benefits :(


I have a question about global and texture memory. But first, let me explain the architecture of my application. I have a (binary) tree which I traverse on the GPU using CUDA. Threads are traversing the tree independently, but there is a strong assumption that the common traversal subsequence will be long and most of the time, threads will traverse the same nodes, while computing a little different things.

The data are referenced from the leaves but stored in a separate chunk of memory. The tree structure is mixed from floats and ints, so I use to texture references to access them. The data referenced from the leaves are floats.

I don’t change to the data, so I thought that using texture memory will speed up the application. However, I’m getting better results if the data are in the global memory.

The data referenced from leaves are always multiples of 9 floats (3 coordinates for 3 vertices), but I’m accessing them as a single float. Maybe I’m wrong, but I suppose float3 can’t be used with textures and since I get them from CPU as triples of float, I don’t have any other choice. What I understand from the Documentation, the float4 would be implicitly used instead of float3. Please correct me if I’m wrong.
I also thought of accessing them using multiple references of type float1, float2 and float4, but it is the next step.

Well, the main question is, what can cause the slowdown when using the texture memory instead of the global mem (tex1Dfetch(ref_to_g_mem, pos) instead of g_mem[pos])? I’m somehow not getting the benefits, mentioned in the Documentation :(. The patterns of memory accesses is always the same.

Thanks for any hint

If your global reads are colaesced then you wont find great benefits from texture memory – This is what people say.

That said,
I had an application in which only one thread was accessing global emory in a loop. Other threads just wait. I replaced global-fetches with texture fetches and did NOT find great improvement. It just saved 1ms or 2ms for processing of 1000 options.

However one should also know that if there are other latencies (register hazard latencies, less CUDA occupancy etc…) that are dominating your global mem latency then you wont SEE the performance improvement with Texture fetches. I think my application is suffering from such a problem (I cant change it. See post on dynamic block scheduling).

Check your application.

Thanks for info. I probably need all the threads to access the memory, since I don’t know, if all threads will use the same data (they can traverse different parts of the tree at the bottom). So I guess, all the threads must read the data for themselves. However, in the upper part of the tree, threads should always access the same int/float item (on random address) in the global/texture memory, so I think the reads are not (and can’t be) coalesced. I know, that in case of shared memory, the data can be fetched in a single memory access, if all the threads read from the same position. This probably does not apply to the global/texture memory (or texture cache). But still, I don’t see the reason, why the cache of the texture memory does not deliver any benefits, even it should. I think I’m still missing something :)

I will probably try to use shared memory and implement my own cache.

I’m still a little confused. If you could give me any advice, how to manage it, I would really appreciate it.

Thanks a lot

So you are making 9 calls to tex1Dfetch to read all the data? If I had to guess, that would be the cause of your slowdown. The textures are cached, but each read requires a little bit of overhead to setup the texture read, not to mention that caching 9 independent textures has to put a bit of a burden on the tiny cache.

I would organize your reads into 2 float4 texture reads and 1 float read to get the 9 values with minimal overhead.

Thanks, I was thinking about that, but there are some other problems. I’m storing triangles… so 9 floats per triangle… They are in one huge chunk of memory and organized into sets. The size of a set can vary and each set is aligned to start at address, which is a multiple of 8 bytes (address mod 8 = 0). This restriction comes from CPU and is unavoidable. So there can be 4B (1float) spaces between the sets. Therefore, they are not aligned to easily read as float4, but it still would be manageable and I will try, if I get some better results. I always know the offset of the first triangle in the memory and the count. The advantage is, that I always read aprox. 16 of the triangles, so I can read float4s and assign the correct values to correct variables. A little pain for debugging, but should work.

Another question. Each thread in block can possibly read another set of triangles, but most of the time, they read the same set. So it would be great if I could somehow share the triangles, but only in some cases. Also… I can’t use __syncthreads(), because the threads are traversing a tree and some of them can be finished, while the others still traverse and fetch triangles. Do you think there is way how to manage this? It would be great if the threads could cooperate. Because right now, they all access the global/texture memory and most of the time, they fetch the same values. However, not all the time… and that is the problem. I think, I’m just dreaming :)

Anyway, thanks for the reply. If you have an idea, let me know.