matrix multiplication; texture vs global; ALU:TEX ratio; broadcast if tex_mem >= global_mem then

If texture memory is as good as global memory, why don’t we use it for matrix multiplication? Why using shmem instead?
Or even better, why not read the matrixes from texture memory to shmem instead of using global. That would solve both the problem of uncoalesced reads (since these are always present in the matrix product) and the issue of duplicate reads.

Is the problem the number of TEX units that causes a bottleneck? On the other hand, isn’t the bandwidth the same for global and texture memory? Then number of TEX units shouldn’t make a difference, right?
I searched but couldn’t find this anywhere. What is the official information on the ALU:TEX ratio and bus width?

Does global/texture memory supports broadcast? What happens if N threads read from the same address?

I’m trying to code an example myself by using texture and shared memory… but I’m having some problems with transposition (and bank conflicts)… so my results are not reliable yet. But it’s either as fast or slower than the code in the sdk. Which doesn’t make much sense.

CUBLAS uses texture memory for some problem sizes (non-power-of-two, mostly), so yes, it makes sense in some cases.

Doing this will essentially duplicate data between the shared mem and the texture cache, and will not take advantage of the texture cache at all. Nor will it magically solve coalescing and duplicate read issues.

Cache misses in texture memory have a higher latency than global loads, because the hardware have to first look up the address in the cache tags, fail to find it, then issue an external memory load. So if you know there is no spacial locality in your data, you’d better just use global memory.

Also, each time you read 4 bytes from the texture memory and cause a miss, a whole 256-byte cache line will be fetched from the DRAM. Which may be much worse than using uncoalesced loads (at least on GT200-class hardware).

Not official, but this Wikipedia page is fairly complete:…88xxx.29_series

This is the area where texture caches really shine… This is mostly what they were designed for actually…

So when matrices are powers of 2, then half-warps (16 threads) always access global memory in a coalesced way. That’s why the matrixMul example uses 16x16 thread blocks, because to have coalesced accesses you need to have half-warps (in this case each thread row) to access sequential values. Correct?

(note that I had previously incorrectly stated that uncoalesced reads are always present in matrixMul, I had not know that coalescing works at the half-warp level… or am I wrong? :s)

When the matrices aren’t power of 2, we can’t get 16 threads (ie, each half-warp) to do the same operation and thus some will not be coalesced, and the texture memory will be faster.

But I still don’t understand why texture memory loses to global memory (if this really is the case??!!) when accesses are coalesced.

And Volkov matrix multiplication doesn’t even mention the use of textures! So there must be a reason.

Does this has to do with shmem bandwidth?

With the GTX275, with shmem I can have 1.3 TB/s (240 ALU * 4 bytes * 1404 Mhz (shader freq)). While with texture memory I only get 810 GB/s (80 TEX * 16 bytes * 633 Mhz (core freq)), assuming a 3:1 ALU:TEX ratio and a vec4 per clock.

Or am I just rambling here?

I feel I’m missing some pieces of the puzzle.

And another thing, whenever I have to read something from global memory, the TEX unit is always used for translating the address, correct? I read something about that being the case for AMD gpus, but I may have missunderstood.

So cache misses are the only reason why texture can get slower than global?

Or does the bandwith issue of shmem and texture memory (that I mentioned above) has also to do with it?

Thanks for your feedback.

Yes, I believe this is correct.

My observations show that there is a latency increase of around 70ns when using texture memory (cache miss) over global memory.

It may or may not matter for the final performance…

Volkov matrix multiplication is compute-bound, so memory accesses are not the bottleneck, so how memory is accessed should have little impact on performance.

Not sure about the precise figures, but you are right that shared mem is much faster than the texture cache, both in latency and bandwidth.

So it definitely make sense to prefer using shared mem + global memory rather than texture memory by itself.

The texture cache cannot predict exactly which data will be reused soon after and which will not, but for some problems as simple as matrix multiplications the programmer can do it. In these cases a software-managed cache can be much faster than a hardware cache.

Now using both shared mem + texture memory may be faster if you’ve got a lot of uncoalesced accesses. But since the texture cache is smaller than the shared memory, there is no real advantage in cascading them.

This is correct, I guess…