I have fairly large read-only array, (400*45000) single-precision float, that I am accessing using a 2-dimensional texture memory binding.
The array is partitioned in a column-cyclic manner over 45000 threads. In the inner-most loop, each thread is accessing elements in its assigned column.
When I ran the code using the texture-memory based accesses, the performance dropped by 15% over accessing the array when it was stored in global memory.
To understand the performance degradation, I would like to understand how the texture cache behaves for 2-D accesses. Specifically,
- What is the policy of fetching data from memory into texture caches? Is is row-major, column-major- or blocked?
- Is there an optimal way of traversing the 2-D texture map? I read somewhere that a space-filling curve-based navigation may give the best performance. Is this accurate?
- What is the cache replacement policy? What is the cost of eviction?
Also, why is text2D() omitted from the reference manual (2.1)? I spent a whole day trying to figure out the exact order of the parameters.
Any information would be really useful.
Thanks!
Rajesh