Using Texture Memory for Matrix Data?

The matrix multiplication sample in CUDA C++ Programming Guide uses 1D global memory to maintain the matrix data:

// Host code
// Load A and B to device memory
Matrix d_A;
d_A.width = A.width; d_A.height = A.height;
size_t size = A.width * A.height * sizeof(float);
cudaMalloc(&d_A.elements, size);
cudaMemcpy(d_A.elements, A.elements, size,
            cudaMemcpyHostToDevice);
// ...

However, texture/surface memory is more efficient for 2D data according to the doc:

The texture cache is optimized for 2D spatial locality, so threads of the same warp that read texture or surface addresses that are close together in 2D will achieve best performance. Also, it is designed for streaming fetches with a constant latency

So is it better to store 2D data like matrix in tex2D objects? Is there any example?

There are CUDA sample codes that demonstrate the usage of 2D textures, such as this one

Whether or not texture provides any benefit is not something that can be answered simply yes or no. It will likely depend on problem sizes, and may also depend on GPU type.

If you want the fastest matrix-multiply performance, the usual recommendation is to use CUBLAS. Don’t write the code yourself.

You can find other somewhat similar questions, with a bit of searching.