I’ve decided to learn how textures work with the simple matrix multiplication problem.
I’ve compared my results with the “matrixMul” example given in the SDK.
(Note: This matrixMul uses shared memory to manage not-coalesced read and write.)
With matrixes of size 1000x5, the use of textures speed-up the multiplication process by a factor 3:
Duration without texture : 151.78ms
Duration with texture : 54.79ms
But with matrixes of size 1000x30, the use of textures slow-down the multiplication process:
Duration without texture : 252.21ms
Duration with texture : 320.64ms
I’d like to understand everything about textures. In my opinion, I think that the textures are not very well documented in the programming guide.
Can you give me some useful informations about how textures work?
Thanks,
Vince
This is not a direct answer to your question, but I noticed in the CUBLAS source code that many of the functions implement both texture and non-texture versions. An if-statement selects the appropriate version based on the matrix/vector size. It might be educational to take a look at some of these functions to see how NVIDIA chooses between the two for maximum performance.
When do we have to use textures and when do we have to use global memory?
Even is the answer depends on the application, I’m sure that it’s possible to give a kind of general way to efficiently implement a given method. I mean I have read entirely the programming guide, and I don’t know what is the best way to implement my method and I’m sure that I’m not the only one in this case. CUDA is a very powerful tool ONLY if it’s well used.
It all depends on the application I am afraid. It is often necessary to implement several versions of your algorithm and benchmark which implementation is the fastest (and under which input-sizes).
1D Textures are useful when you can almost, but not quite coalesce your reads. 2D textures are useful if you read across rows and down columns in a single warp. They allow you to reach full memory bandwidth in these cases.
I’ve personally never seen texture reads beat the 70 GiB/s effective global memory bandwidth. The “cache” serves only to help with local reads within a warp.
On an 8800 GTX, optimal texture bandwidths are the same as global memory: 70 GiB/s.
Using 2D textures are useful when you consecutive threads read down columns instead of across rows, and when reading down columns and across rows (think image filtering).
You can always compute your effective memory rate based on the number of bytes read/written to see how close you are pushing the device limits.