Hi, all,
I am trying to understand 2D locality for texture memory use
and wondering when exactly optimal 2D locality is realized in terms of threads.
For example, take a look at following serial code and two equivalent CUDA codes: one for 2D locality and the other for 1D locality
// serial code
float A[M][N];
float B[M][N];
for (int i=0; i<M; i++) {
for (int j=0; j<N; j++) {
A[M][N] = B[M][N];
}
}
// CUDA 1 (2D locality simulation)
texture <float, 2, cudaReadModeElementType> texRef;
dim3 threads(16, 16);
dim3 grid(M/16, N/16);
global kernel(…) {
int x = blockIdx.xblockDim.x + threadIdx.x;
int y = blockIdx.yblockDim.y + threadIdx.y;
A[y] = tex2D(texRef, x, y);
}
// CUDA 2 (1D locality simulation)
texture <float, 2, cudaReadModeElementType> texRef;
dim3 threads(256, 1);
dim3 grid(M/256, N/1);
global kernel(…) {
int x = blockIdx.xblockDim.x + threadIdx.x;
int y = blockIdx.yblockDim.y + threadIdx.y;
A[y] = tex2D(texRef, x, y);
}
I tested these two simple test programs but don’t see any performance differences.
Could you explain why and what case it could make a different in performance?
Thanks,