I have a question about global and texture memory. But first, let me explain the architecture of my application. I have a (binary) tree which I traverse on the GPU using CUDA. Threads are traversing the tree independently, but there is a strong assumption that the common traversal subsequence will be long and most of the time, threads will traverse the same nodes, while computing a little different things.
The data are referenced from the leaves but stored in a separate chunk of memory. The tree structure is mixed from floats and ints, so I use to texture references to access them. The data referenced from the leaves are floats.
I don’t change to the data, so I thought that using texture memory will speed up the application. However, I’m getting better results if the data are in the global memory.
The data referenced from leaves are always multiples of 9 floats (3 coordinates for 3 vertices), but I’m accessing them as a single float. Maybe I’m wrong, but I suppose float3 can’t be used with textures and since I get them from CPU as triples of float, I don’t have any other choice. What I understand from the Documentation, the float4 would be implicitly used instead of float3. Please correct me if I’m wrong.
I also thought of accessing them using multiple references of type float1, float2 and float4, but it is the next step.
Well, the main question is, what can cause the slowdown when using the texture memory instead of the global mem (tex1Dfetch(ref_to_g_mem, pos) instead of g_mem[pos])? I’m somehow not getting the benefits, mentioned in the Documentation :(. The patterns of memory accesses is always the same.
Thanks for any hint