Optimal 2D Locality Test for Texture Memory

Hi, all,

I am trying to understand 2D locality for texture memory use
and wondering when exactly optimal 2D locality is realized in terms of threads.

For example, take a look at following serial code and two equivalent CUDA codes: one for 2D locality and the other for 1D locality

// serial code

float A[M][N];
float B[M][N];
for (int i=0; i<M; i++) {
for (int j=0; j<N; j++) {
A[M][N] = B[M][N];
}
}

// CUDA 1 (2D locality simulation)

texture <float, 2, cudaReadModeElementType> texRef;
dim3 threads(16, 16);
dim3 grid(M/16, N/16);
global kernel(…) {
int x = blockIdx.xblockDim.x + threadIdx.x;
int y = blockIdx.y
blockDim.y + threadIdx.y;
A[y] = tex2D(texRef, x, y);
}

// CUDA 2 (1D locality simulation)

texture <float, 2, cudaReadModeElementType> texRef;
dim3 threads(256, 1);
dim3 grid(M/256, N/1);
global kernel(…) {
int x = blockIdx.xblockDim.x + threadIdx.x;
int y = blockIdx.y
blockDim.y + threadIdx.y;
A[y] = tex2D(texRef, x, y);
}

I tested these two simple test programs but don’t see any performance differences.
Could you explain why and what case it could make a different in performance?
Thanks,

Not sure if this is correct, but my understanding is this…

Texture cache (16KB): neighboring elements in all directions of one access point can be stored in the cache for more immediate access by other threads.
2D texture: more directions can be covered, thus more random access will benefit from this.
1D texture: only front and back neighbors can be stored in the cache. Very similar to global access.

Seems like your code is not fetching elements randomly, so the access path is as fast as regular coalesced global access in both cases.