Understanding GPU caches can't get my head around it

Axure · March 11, 2009, 11:47am

In a Real World Technologies article it is stated:

What does it precisely mean that tex caches can’t work out of order? That they can’t be read at arbitrary addresses (“gather”)? That they have to be read en masse - a whole texture at once? A whole tex cache partition (32KB for L2, 8KB for L1) at once? Instead of a word/byte/4B? So are they mostly useless for computational purposes?

Also, why can’t they cut latency? Are they slow? If a game (not GPGPU) reads same texture multiple times and thus uses the tex cache, doesn’t it get it faster than from the main memory?

MisterAnderson42 · March 11, 2009, 12:07pm

They are immensely useful for computational purposes. My project, HOOMD, is a factor of 3 to 5 faster with textures than without. Its major bottleneck is a large amount of semi-random reads using tex1Dfetch.

For games, I think the texture caches are primarily designed to facilitate filtering.

I would guess that the comment about GPU caches not reducing latency like CPU ones is that the GPU caches are read-only caches. On the CPU, values can be read and written many times while staying in the low latency cache the whole time. The other side of this is that the GPU caches are so small that previously read values are flushed out so fast it is almost as if there is no temporal locality.

Axure · March 11, 2009, 12:18pm

So you can read caches at random and it is faster than reading from memory? As long as you’re fine with read-only (i.e. don’t write texutres) you do get a significant latency improvement over the insane 600 cycles for main mem access?

(Of course, I realize it’s not the programmer who reads the caches, 'cuz it’s auto-managed, as opposed to user-controlled Shared Memory.)

MisterAnderson42 · March 11, 2009, 1:10pm

Absolutely. Bandwidth through the cache mainly depends on on local the access are within the 32 threads of each warp. The more local you make those accesses, the more bandwidth you will get.

Latency is just so hard to evaluate on the massive latency hiding GPU architecture. I can say with 100% confidence that local memory access patterns as mentioned above increase bandwidth by many factors vs uncached global memory reads. I’ve only ever done one benchmark regarding latency: A kernel that read just a single value out of the texture over and over again. The effective bandwidth was something like 200 GiB/s (this was on a G92), implying that there was definitely some latency reduction happening, but as to evaluating the actual latency? I wouldn’t have a guess where to begin.

YDD · March 11, 2009, 1:25pm

I had a query about this last week, When I wrote a program to check, I found that using the texture cache gave me a 30% speed boost over fully coalesced reads on a C1060. I had multiple blocks reading the same data, and that block of data fit entirely in the GPU texture cache.

Axure · March 11, 2009, 1:29pm

Thanks a lot for these answers! :)

SPWorley · March 11, 2009, 1:31pm

You have to be more careful about the word “faster”. It can mean a lot of things.

Your panic over a 600 cycle latency is a CPU-centric view where your CPU would be wasted while waiting. In practice a GPU doesn’t even notice such latencies because you have other warps which can do work while waiting. CUDA is amazing at hiding even huge latencies, enough where you can ignore it in most cases.
Bandwidth is the most common computational bottleneck, and that’s what the caches reduce.

Nvidia probably could engineer the texture caches to reduce latency as well, but what’s the point, they already have a better way of eliminating latency issues by massive cost-free threading. This means they don’t need to use a more complex cache handler which deals with servicing memory requests out of order.

So caches are not “faster” in terms of shorter access times. What they do is reduce the amount of data you need to pull through your limited connection between the GPU and the device memory. That pipe is monster-huge (100+GB/sec) but many apps will use every bit of it and ask for more. In those (common) cases, texture caches reduce the impact of this bottleneck, and therefore your app indeed runs faster.

Mr. Anderson is very correct in that CPU caches really do behave differently than GPU texture caches. But that’s a good thing!

Axure · March 11, 2009, 2:50pm

OK, so I’m really getting confused here. I do understand (finally) that having a small memory on-chip saves the effort of reading often-used data from main memory and so the bandwidth is saved. But I thought that tex cache can be used also for reducing latencies. Id est, that tex caches do have short access times. Don’t they?

On a side note, 140 GBps is not that huge when you realize it has to feed 240 “cores”. When you do the math, it might turn out that this bandwidth is not smaller at all from what a CPU core enjoys. But that’s of course just me playing with numbers. ;)

pacard · March 11, 2009, 3:03pm

No. The latency of tex cache is the same as that of texture memory. Here is what the CUDA programming guide says in 5.1.2.4:

Also, it is designed for streaming fetches with a constant latency, i.e. a cache hit reduces DRAM bandwidth demand, but not fetch latency.

My guess is that the texture cache is just another DRAM that uses a different bus than that of the global/texture memory.

seibert · March 12, 2009, 1:15am

This is very true, and explains why lookup tables are sometimes counterproductive in GPU algorithms, especially if that lookup table is in global memory. It can be cheaper to recompute a value rather than read it from memory.

MisterAnderson42 · March 12, 2009, 11:03am

But in a CPU calculation you could never assign just a single array element per core or run 40,000 threads concurrently, while this is the norm for GPUs.

Axure · March 14, 2009, 7:49pm

OK, I know I’m killing you with these questions, but I’d like to have the full picture.

In a GPGPU program, is all the data that is read from main graphic memory potentially cached in the texture cache or does the programmer need to do anything to make it cacheable? (Like explicitly define them as textures or sth… Although it would be silly I guess, since CUDA is all about not having to deal with textures.)
Are there any educated guesses on what the actual latencies of texture caches are? I guess there must be some access time benefit of having this memory on-chip. No to mention - what would be the point of doing L1 and L2 if they worked just the same…

MisterAnderson42 · March 14, 2009, 8:10pm

You have to declare the texture and bind the data to the texture. Look up cudaBindTexture in the programming guide. If your data has 2D locality and/or you want hardware filtering, you need to allocate memory in a cudaArray and bind it to a texture with cudaBindTextureToArray. So it is not automatic.

Topic		Replies	Views
Confusion on using texture? CUDA Programming and Performance	14	4936	September 4, 2009
CUDA texture memory performance CUDA Programming and Performance	4	33585	January 13, 2009
About texture cache and spatial locality CUDA Programming and Performance	15	11203	July 24, 2009
Textures CUDA Programming and Performance	2	1644	July 22, 2008
Benefits of Texture Memory couldnt use them... CUDA Programming and Performance	6	3201	February 13, 2008
texture memory vs global memory CUDA Programming and Performance	10	13753	August 20, 2007
I am trying to compare the performance of texture fetch and usual memory fetch CUDA Programming and Performance	10	2264	July 19, 2010
Memory performance in image processing example CUDA Programming and Performance	9	1610	March 24, 2011
texture-cache miss ...what happens to the warp? CUDA Programming and Performance	7	6469	October 15, 2008
Is GPU worth it? GPU currently too slow. CUDA Programming and Performance	16	6043	December 8, 2008

Understanding GPU caches can't get my head around it

Related topics