Matrix multiplication using texture

garciav · April 17, 2008, 8:39am

Hi,

I’ve decided to learn how textures work with the simple matrix multiplication problem.
I’ve compared my results with the “matrixMul” example given in the SDK.
(Note: This matrixMul uses shared memory to manage not-coalesced read and write.)

With matrixes of size 1000x5, the use of textures speed-up the multiplication process by a factor 3:

Duration without texture : 151.78ms
Duration with texture : 54.79ms

But with matrixes of size 1000x30, the use of textures slow-down the multiplication process:

Duration without texture : 252.21ms
Duration with texture : 320.64ms

I’d like to understand everything about textures. In my opinion, I think that the textures are not very well documented in the programming guide.
Can you give me some useful informations about how textures work?
Thanks,
Vince

seibert · April 17, 2008, 12:22pm

This is not a direct answer to your question, but I noticed in the CUBLAS source code that many of the functions implement both texture and non-texture versions. An if-statement selects the appropriate version based on the matrix/vector size. It might be educational to take a look at some of these functions to see how NVIDIA chooses between the two for maximum performance.

garciav · April 17, 2008, 12:33pm

Actually, my question is related to your remark.

When do we have to use textures and when do we have to use global memory?

Even is the answer depends on the application, I’m sure that it’s possible to give a kind of general way to efficiently implement a given method. I mean I have read entirely the programming guide, and I don’t know what is the best way to implement my method and I’m sure that I’m not the only one in this case. CUDA is a very powerful tool ONLY if it’s well used.

DenisR · April 17, 2008, 12:39pm

It all depends on the application I am afraid. It is often necessary to implement several versions of your algorithm and benchmark which implementation is the fastest (and under which input-sizes).

MisterAnderson42 · April 17, 2008, 1:43pm

1D Textures are useful when you can almost, but not quite coalesce your reads. 2D textures are useful if you read across rows and down columns in a single warp. They allow you to reach full memory bandwidth in these cases.

I’ve personally never seen texture reads beat the 70 GiB/s effective global memory bandwidth. The “cache” serves only to help with local reads within a warp.

garciav · April 17, 2008, 3:08pm

Do you know the bandwidth of the texture?

Matrix multiplication make coalesced reads on matrix A, non-coalesced reads on matrix B, and coalesced write on matrix AB (the result).

Maybe, it’s a good idea to use global memory for A and AB, and texture for B…
Waiting comments :)

Another thing. The matrix is of course a 2D array. But I’ve read that we can use 1D array instead. Do you have a comment about that?

MisterAnderson42 · April 17, 2008, 3:34pm

On an 8800 GTX, optimal texture bandwidths are the same as global memory: 70 GiB/s.

Using 2D textures are useful when you consecutive threads read down columns instead of across rows, and when reading down columns and across rows (think image filtering).

You can always compute your effective memory rate based on the number of bytes read/written to see how close you are pushing the device limits.

Topic		Replies	Views
matrix multiplication; texture vs global; ALU:TEX ratio; broadcast if tex_mem >= global_mem then CUDA Programming and Performance	3	2582	July 25, 2009
Texture? Just a short lesson... CUDA Programming and Performance	5	2721	March 9, 2008
texture memory vs global memory CUDA Programming and Performance	10	13788	August 20, 2007
Question about textures CUDA Programming and Performance	5	7840	May 9, 2008
Using Texture Memory for Matrix Data? CUDA Programming and Performance	1	190	March 25, 2024
When to use textures CUDA Programming and Performance	7	8134	February 12, 2008
Using texture to speed up sparse matrix mult? CUDA Programming and Performance	5	3564	May 10, 2007
Texture vs. Global Memory CUDA Programming and Performance	4	2021	August 6, 2009
In what case, using text mem is slower than not using? CUDA Programming and Performance	3	1429	September 20, 2009
For what case should I use texture memory? CUDA Programming and Performance	8	2678	May 26, 2010

Matrix multiplication using texture

Related topics