Understanding Performance of 2D Texture memory accesses

I have fairly large read-only array, (400*45000) single-precision float, that I am accessing using a 2-dimensional texture memory binding.
The array is partitioned in a column-cyclic manner over 45000 threads. In the inner-most loop, each thread is accessing elements in its assigned column.

When I ran the code using the texture-memory based accesses, the performance dropped by 15% over accessing the array when it was stored in global memory.
To understand the performance degradation, I would like to understand how the texture cache behaves for 2-D accesses. Specifically,

  1. What is the policy of fetching data from memory into texture caches? Is is row-major, column-major- or blocked?
  2. Is there an optimal way of traversing the 2-D texture map? I read somewhere that a space-filling curve-based navigation may give the best performance. Is this accurate?
  3. What is the cache replacement policy? What is the cost of eviction?

Also, why is text2D() omitted from the reference manual (2.1)? I spent a whole day trying to figure out the exact order of the parameters.

Any information would be really useful.

Thanks!
Rajesh

I believe it is a Z-curve. Simon Green once posted a link to a wikipedia page explaining how it is done, this is that page http://en.wikipedia.org/wiki/Z-order_(curve)

Space filling curves are the best way to store 2 or 3D data in a 1D texture (this is what tex2D is doing under the hood as has already been pointed out).

The best way to access your 2D texture is to have threads in each warp access elements nearby along the row of the texture. The next best way (only ~1-2% slower in microbenchmarks I’ve done) is to have threads in each warp read nearby values going down a column.

Denis and MisterAnderson42,

Thanks for the link and the thread-mapping strategy. I will try to reorganize my code to suit these constraints.

-regards,
Rajesh