Matrix multiplication using texture


I’ve decided to learn how textures work with the simple matrix multiplication problem.
I’ve compared my results with the “matrixMul” example given in the SDK.
(Note: This matrixMul uses shared memory to manage not-coalesced read and write.)

With matrixes of size 1000x5, the use of textures speed-up the multiplication process by a factor 3:

Duration without texture : 151.78ms
Duration with texture : 54.79ms

But with matrixes of size 1000x30, the use of textures slow-down the multiplication process:

Duration without texture : 252.21ms
Duration with texture : 320.64ms

I’d like to understand everything about textures. In my opinion, I think that the textures are not very well documented in the programming guide.
Can you give me some useful informations about how textures work?

This is not a direct answer to your question, but I noticed in the CUBLAS source code that many of the functions implement both texture and non-texture versions. An if-statement selects the appropriate version based on the matrix/vector size. It might be educational to take a look at some of these functions to see how NVIDIA chooses between the two for maximum performance.

Actually, my question is related to your remark.

When do we have to use textures and when do we have to use global memory?

Even is the answer depends on the application, I’m sure that it’s possible to give a kind of general way to efficiently implement a given method. I mean I have read entirely the programming guide, and I don’t know what is the best way to implement my method and I’m sure that I’m not the only one in this case. CUDA is a very powerful tool ONLY if it’s well used.

It all depends on the application I am afraid. It is often necessary to implement several versions of your algorithm and benchmark which implementation is the fastest (and under which input-sizes).

1D Textures are useful when you can almost, but not quite coalesce your reads. 2D textures are useful if you read across rows and down columns in a single warp. They allow you to reach full memory bandwidth in these cases.

I’ve personally never seen texture reads beat the 70 GiB/s effective global memory bandwidth. The “cache” serves only to help with local reads within a warp.

Do you know the bandwidth of the texture?

Matrix multiplication make coalesced reads on matrix A, non-coalesced reads on matrix B, and coalesced write on matrix AB (the result).

Maybe, it’s a good idea to use global memory for A and AB, and texture for B…
Waiting comments :)

Another thing. The matrix is of course a 2D array. But I’ve read that we can use 1D array instead. Do you have a comment about that?

On an 8800 GTX, optimal texture bandwidths are the same as global memory: 70 GiB/s.

Using 2D textures are useful when you consecutive threads read down columns instead of across rows, and when reading down columns and across rows (think image filtering).

You can always compute your effective memory rate based on the number of bytes read/written to see how close you are pushing the device limits.