Global versus Texture Memory - no speedup I'm not getting any benefits :(

twoflower · February 15, 2008, 3:49am

Hello,

I have a question about global and texture memory. But first, let me explain the architecture of my application. I have a (binary) tree which I traverse on the GPU using CUDA. Threads are traversing the tree independently, but there is a strong assumption that the common traversal subsequence will be long and most of the time, threads will traverse the same nodes, while computing a little different things.

The data are referenced from the leaves but stored in a separate chunk of memory. The tree structure is mixed from floats and ints, so I use to texture references to access them. The data referenced from the leaves are floats.

I don’t change to the data, so I thought that using texture memory will speed up the application. However, I’m getting better results if the data are in the global memory.

The data referenced from leaves are always multiples of 9 floats (3 coordinates for 3 vertices), but I’m accessing them as a single float. Maybe I’m wrong, but I suppose float3 can’t be used with textures and since I get them from CPU as triples of float, I don’t have any other choice. What I understand from the Documentation, the float4 would be implicitly used instead of float3. Please correct me if I’m wrong.
I also thought of accessing them using multiple references of type float1, float2 and float4, but it is the next step.

Well, the main question is, what can cause the slowdown when using the texture memory instead of the global mem (tex1Dfetch(ref_to_g_mem, pos) instead of g_mem[pos])? I’m somehow not getting the benefits, mentioned in the Documentation :(. The patterns of memory accesses is always the same.

Thanks for any hint
–jan

Sarnath · February 15, 2008, 6:03am

If your global reads are colaesced then you wont find great benefits from texture memory – This is what people say.

That said,
I had an application in which only one thread was accessing global emory in a loop. Other threads just wait. I replaced global-fetches with texture fetches and did NOT find great improvement. It just saved 1ms or 2ms for processing of 1000 options.

However one should also know that if there are other latencies (register hazard latencies, less CUDA occupancy etc…) that are dominating your global mem latency then you wont SEE the performance improvement with Texture fetches. I think my application is suffering from such a problem (I cant change it. See post on dynamic block scheduling).

Check your application.

twoflower · February 15, 2008, 9:22am

Thanks for info. I probably need all the threads to access the memory, since I don’t know, if all threads will use the same data (they can traverse different parts of the tree at the bottom). So I guess, all the threads must read the data for themselves. However, in the upper part of the tree, threads should always access the same int/float item (on random address) in the global/texture memory, so I think the reads are not (and can’t be) coalesced. I know, that in case of shared memory, the data can be fetched in a single memory access, if all the threads read from the same position. This probably does not apply to the global/texture memory (or texture cache). But still, I don’t see the reason, why the cache of the texture memory does not deliver any benefits, even it should. I think I’m still missing something :)

I will probably try to use shared memory and implement my own cache.

I’m still a little confused. If you could give me any advice, how to manage it, I would really appreciate it.

Thanks a lot
–jan

MisterAnderson42 · February 17, 2008, 4:09pm

So you are making 9 calls to tex1Dfetch to read all the data? If I had to guess, that would be the cause of your slowdown. The textures are cached, but each read requires a little bit of overhead to setup the texture read, not to mention that caching 9 independent textures has to put a bit of a burden on the tiny cache.

I would organize your reads into 2 float4 texture reads and 1 float read to get the 9 values with minimal overhead.

twoflower · February 17, 2008, 5:26pm

Thanks, I was thinking about that, but there are some other problems. I’m storing triangles… so 9 floats per triangle… They are in one huge chunk of memory and organized into sets. The size of a set can vary and each set is aligned to start at address, which is a multiple of 8 bytes (address mod 8 = 0). This restriction comes from CPU and is unavoidable. So there can be 4B (1float) spaces between the sets. Therefore, they are not aligned to easily read as float4, but it still would be manageable and I will try, if I get some better results. I always know the offset of the first triangle in the memory and the count. The advantage is, that I always read aprox. 16 of the triangles, so I can read float4s and assign the correct values to correct variables. A little pain for debugging, but should work.

Another question. Each thread in block can possibly read another set of triangles, but most of the time, they read the same set. So it would be great if I could somehow share the triangles, but only in some cases. Also… I can’t use __syncthreads(), because the threads are traversing a tree and some of them can be finished, while the others still traverse and fetch triangles. Do you think there is way how to manage this? It would be great if the threads could cooperate. Because right now, they all access the global/texture memory and most of the time, they fetch the same values. However, not all the time… and that is the problem. I think, I’m just dreaming :)

Anyway, thanks for the reply. If you have an idea, let me know.

–jan

Topic		Replies	Views
Texture vs. Global Memory CUDA Programming and Performance	4	2056	August 6, 2009
texture memory vs global memory CUDA Programming and Performance	10	13915	August 20, 2007
Texture and Global Memory CUDA Programming and Performance	2	3877	July 11, 2007
When to use textures CUDA Programming and Performance	7	8206	February 12, 2008
Texture Memory vs. Global Memory and float4 CUDA Programming and Performance	5	1906	November 1, 2010
texture vs global memory CUDA Programming and Performance	0	2908	December 16, 2009
Is coalescing access important to texture memory? CUDA Programming and Performance	10	12943	March 16, 2008
Copy from texture memory to shared memory Confused about best transfer strategy CUDA Programming and Performance	4	1625	February 11, 2010
what's the benefit of using texture memory in Fermi verus using global memory CUDA Programming and Performance	12	2853	August 9, 2010
Texture memory faster than global memory? texture memory vs coalesced memory access CUDA Programming and Performance	3	2063	May 18, 2009

Global versus Texture Memory - no speedup I'm not getting any benefits :(

Related topics