CUDA texture cache

How big is the G80’s texture cache? 6Mb? 4Mb? 2Mb?
I ask this because I need to traverse a big tree stored in a texture and the algorithm has no coherency ( so the kernel gonna travel very random parts of the texture… and the shared memory cannot help much )


The programming guide (A.1.1) says 6-8 kB per multiprocessor. I assume the exact value depends on the GPU. That’s not going to help you much if you can’t get any data locality.

Well, if a cache miss is only 1/2 cycles then will be ok… compared with the 600 of a global memory read…

having no data locality at all means fetching the data from global memory.

Relatedly: Is the texture cache shared between all multiprocessors?

In other words: Do I benefit if I have fetch locality that’s not within the same thread block?


PS: I know that not all of the potentially-local thread blocks will be scheduled at the same time, but I’d expect that that would be true at least of a certain fraction.

From the CUDA2 docs, section "3.1 A Set of SIMD Multiprocessors with On-Chip

Shared Memory"

A read-only texture cache that is shared by all the processors and speeds up reads

from the texture memory space, which is implemented as a read-only region of

device memory.

And from A.1.1:

The cache working set for texture memory varies between 6 and 8 KB per


So it’s not very clear if it’s shared or not… haha!

Thanks for the quick reply!

You’re right, even with that it’s still not clear. In 3.1, the phrase “that is shared by all processors” is used in reference to shared memory as well.


There are some other papers out there that provide more detail:

You have a Texture Processor Cluster. There the texture unit is, with the cache (16 kB). Such a cluster contains 2 MultiProcessors. Therefore each multiprocessor has 8kB of texture cache ‘available’. And each Multiprocessor (that contains 8 Shader Proccesing Units) has 16 kB of shared memory that is accessible from those 8 SPU’s)