Kernel optimization problem : Texture memory

Hello! I am in need of a little insight.

I am trying to implement my own version of mummerGPU. I have everything working so far and went to the point of optimizing some memory accesses to be coalesced and also reduced warp divergence to a minimum (I believe).

The last step was to optimize the accesses to the suffix tree / reference sequence (string), which are by their nature random and impossible to coalesce (but have some locality). From various sources including mummerGPU’s paper I understand that texture memory could be a great performance booster however when I try to use textures in my kernel the execution time increases quite a bit.

After digging around the PTX files and the profiler logs I noticed that whenever I use textures my kernel starts to load data from local memory. Is this behavior normal and would it explain the performance hit ? Btw, I am binding the texture to linear memory, not using Cuda Arrays yet although I might experiment with them or using 2D textures to see if the problem remains.

Main loop of the Kernel using a char text in global memory :

while (text[edgeStart + edgeOffset] == testChar) {



Same main loop using a texture:

while (tex1Dfetch(refTex,edgeStart + edgeOffset) == testChar) {



The simple difference in the code shown above doubles the execution time of the Kernel and I am at a loss as to why. Any ideas are very welcome. Am I missing something regarding the use of textures?