Understanding GPU caches can't get my head around it

In a Real World Technologies article it is stated:

What does it precisely mean that tex caches can’t work out of order? That they can’t be read at arbitrary addresses (“gather”)? That they have to be read en masse - a whole texture at once? A whole tex cache partition (32KB for L2, 8KB for L1) at once? Instead of a word/byte/4B? So are they mostly useless for computational purposes?

Also, why can’t they cut latency? Are they slow? If a game (not GPGPU) reads same texture multiple times and thus uses the tex cache, doesn’t it get it faster than from the main memory?

They are immensely useful for computational purposes. My project, HOOMD, is a factor of 3 to 5 faster with textures than without. Its major bottleneck is a large amount of semi-random reads using tex1Dfetch.

For games, I think the texture caches are primarily designed to facilitate filtering.

I would guess that the comment about GPU caches not reducing latency like CPU ones is that the GPU caches are read-only caches. On the CPU, values can be read and written many times while staying in the low latency cache the whole time. The other side of this is that the GPU caches are so small that previously read values are flushed out so fast it is almost as if there is no temporal locality.

So you can read caches at random and it is faster than reading from memory? As long as you’re fine with read-only (i.e. don’t write texutres) you do get a significant latency improvement over the insane 600 cycles for main mem access?

(Of course, I realize it’s not the programmer who reads the caches, 'cuz it’s auto-managed, as opposed to user-controlled Shared Memory.)

Absolutely. Bandwidth through the cache mainly depends on on local the access are within the 32 threads of each warp. The more local you make those accesses, the more bandwidth you will get.

Latency is just so hard to evaluate on the massive latency hiding GPU architecture. I can say with 100% confidence that local memory access patterns as mentioned above increase bandwidth by many factors vs uncached global memory reads. I’ve only ever done one benchmark regarding latency: A kernel that read just a single value out of the texture over and over again. The effective bandwidth was something like 200 GiB/s (this was on a G92), implying that there was definitely some latency reduction happening, but as to evaluating the actual latency? I wouldn’t have a guess where to begin.

I had a query about this last week, When I wrote a program to check, I found that using the texture cache gave me a 30% speed boost over fully coalesced reads on a C1060. I had multiple blocks reading the same data, and that block of data fit entirely in the GPU texture cache.

Thanks a lot for these answers! :)

You have to be more careful about the word “faster”. It can mean a lot of things.

Your panic over a 600 cycle latency is a CPU-centric view where your CPU would be wasted while waiting. In practice a GPU doesn’t even notice such latencies because you have other warps which can do work while waiting. CUDA is amazing at hiding even huge latencies, enough where you can ignore it in most cases.
Bandwidth is the most common computational bottleneck, and that’s what the caches reduce.

Nvidia probably could engineer the texture caches to reduce latency as well, but what’s the point, they already have a better way of eliminating latency issues by massive cost-free threading. This means they don’t need to use a more complex cache handler which deals with servicing memory requests out of order.

So caches are not “faster” in terms of shorter access times. What they do is reduce the amount of data you need to pull through your limited connection between the GPU and the device memory. That pipe is monster-huge (100+GB/sec) but many apps will use every bit of it and ask for more. In those (common) cases, texture caches reduce the impact of this bottleneck, and therefore your app indeed runs faster.

Mr. Anderson is very correct in that CPU caches really do behave differently than GPU texture caches. But that’s a good thing!

OK, so I’m really getting confused here. I do understand (finally) that having a small memory on-chip saves the effort of reading often-used data from main memory and so the bandwidth is saved. But I thought that tex cache can be used also for reducing latencies. Id est, that tex caches do have short access times. Don’t they?

On a side note, 140 GBps is not that huge when you realize it has to feed 240 “cores”. When you do the math, it might turn out that this bandwidth is not smaller at all from what a CPU core enjoys. But that’s of course just me playing with numbers. ;)

No. The latency of tex cache is the same as that of texture memory. Here is what the CUDA programming guide says in

Also, it is designed for streaming fetches with a constant latency, i.e. a cache hit reduces DRAM bandwidth demand, but not fetch latency.

My guess is that the texture cache is just another DRAM that uses a different bus than that of the global/texture memory.

This is very true, and explains why lookup tables are sometimes counterproductive in GPU algorithms, especially if that lookup table is in global memory. It can be cheaper to recompute a value rather than read it from memory.

But in a CPU calculation you could never assign just a single array element per core or run 40,000 threads concurrently, while this is the norm for GPUs.

OK, I know I’m killing you with these questions, but I’d like to have the full picture.

  1. In a GPGPU program, is all the data that is read from main graphic memory potentially cached in the texture cache or does the programmer need to do anything to make it cacheable? (Like explicitly define them as textures or sth… Although it would be silly I guess, since CUDA is all about not having to deal with textures.)

  2. Are there any educated guesses on what the actual latencies of texture caches are? I guess there must be some access time benefit of having this memory on-chip. No to mention - what would be the point of doing L1 and L2 if they worked just the same…

You have to declare the texture and bind the data to the texture. Look up cudaBindTexture in the programming guide. If your data has 2D locality and/or you want hardware filtering, you need to allocate memory in a cudaArray and bind it to a texture with cudaBindTextureToArray. So it is not automatic.