Hi, all,

I am trying to understand 2D locality for texture memory use

and wondering when exactly optimal 2D locality is realized in terms of threads.

For example, take a look at following serial code and two equivalent CUDA codes: one for 2D locality and the other for 1D locality

// serial code

float A[M][N];

float B[M][N];

for (int i=0; i<M; i++) {

for (int j=0; j<N; j++) {

A[M][N] = B[M][N];

}

}

// CUDA 1 (2D locality simulation)

texture <float, 2, cudaReadModeElementType> texRef;

dim3 threads(16, 16);

dim3 grid(M/16, N/16);

**global** kernel(…) {

int x = blockIdx.x*blockDim.x + threadIdx.x;
int y = blockIdx.y*blockDim.y + threadIdx.y;

A[y] = tex2D(texRef, x, y);

}

// CUDA 2 (1D locality simulation)

texture <float, 2, cudaReadModeElementType> texRef;

dim3 threads(256, 1);

dim3 grid(M/256, N/1);

**global** kernel(…) {

int x = blockIdx.x*blockDim.x + threadIdx.x;
int y = blockIdx.y*blockDim.y + threadIdx.y;

A[y] = tex2D(texRef, x, y);

}

I tested these two simple test programs but don’t see any performance differences.

Could you explain why and what case it could make a different in performance?

Thanks,