The matrix multiplication sample in CUDA C++ Programming Guide uses 1D global memory to maintain the matrix data:
// Host code
// Load A and B to device memory
Matrix d_A;
d_A.width = A.width; d_A.height = A.height;
size_t size = A.width * A.height * sizeof(float);
cudaMalloc(&d_A.elements, size);
cudaMemcpy(d_A.elements, A.elements, size,
cudaMemcpyHostToDevice);
// ...
However, texture/surface memory is more efficient for 2D data according to the doc:
The texture cache is optimized for 2D spatial locality, so threads of the same warp that read texture or surface addresses that are close together in 2D will achieve best performance. Also, it is designed for streaming fetches with a constant latency
So is it better to store 2D data like matrix in tex2D
objects? Is there any example?