Texture cache without flush

Hi,

I was wondering whether its possible to keep the contents of the texture cache alive for consecutive kernel calls (without flushing and reloading). In this way for many applications I need not reload texture cache where the kernel is invoked repeatedly or iteratively as a mean of global synchronization.

I also understand that for some application if the memory region accessed by the texture cache changes during each kernel call then its better to flush and reload the texture cache for each kernel call.

But I am looking at other application where the content of texture region remains the same for successive kernel calls.

Any suggestions or possibilities ?

The texture cache is flushed faster than you think: Assuming you are reading float4’s 24 warps/MP * 32 threads/warp * 16 bytes/thread = 12 288 bytes > 8k texture cache/MP. So items start expiring from the texture cache before the MP can even service a single texture read for every warp that is concurrently running.

I agree. But if my kernel code is going to take around 2000 cycles to execute, then I need not wait for an additional 1000-2000 cycles for the texture cache to be filled from the global memory every time I make a kernel call. Assuming that the kernel is called within a loop and the texture cache filling up is expaned for each of the MP, then the overall bandwidth saved can be significant if we avoiding the flush/filling operation.

Why should there be a performance penalty to the flush? It’s not like the textures have a write cache. And you pay the filling time anyways every time a new block shows up on the MP as which blocks execute on which MP is non-deterministic and varies from one kernel execution to the next.

Even if CUDA provided hooks to play with the cache as you suggest, I doubt the performance gain would be measurable over the noise. In kernels where there is good texture read locality within the warp you can already attain performance levels at peak device memory bandwidth: there isn’t any more performance available for the device to give. Whatever small startup penalty you pay in the latency of the first read is more than made up for with the GPUs effective latency hiding for all of the 1000’s of future reads. That and the launch time overhead for a kernel is already ~15us+.

This is the biggest issue here. A kernel which only takes 2000 cycles does ~1.5 us of computation. The launch overhead for the kernel overwhelms the actual calculation, or the filling of the texture cache.

I have some questions related to this discussion.

  1. Is the texture cache always automatically flushed when a new block is loaded onto a MP or not? I have been assuming that it isn’t.

  2. If it isn’t then would there be any way of influencing the loading of blocks onto MPs so as to maximize cache reuse between blocks?

  1. The documentation doesn’t say either way. But I would assume that there is no flush when a new block comes onto the MP it doesn’t make much sense.

  2. You can’t influence what blocks go onto which MP. That is all up to the GPU hardware scheduler and is completely non-deterministic.

Just to add a little more to the discussion, I find it helpful to not think of the texture “cache” as a cache, but rather as an “almost coalesced memory reader”. First, the “cache” is so tiny that by the time the MP gets back to executing a warp the cache has likely already flushed the value that the warp previously read. Thus, temporal locality among reads doesn’t increase performance at all. Data locality of the reads within the warp is all that matters, hence the term “almost coalesced memory reader”.

Second, the “cache” doesn’t seem able to provide more than the device memory bandwidth. I know the programming guide says that the “cache” reads values at the speed of shared memory (or something like that) if there is a hit, but my experimentation shows otherwise. A kernel where every single thread reads the same element from a texture still only attains 70 GiB/s of memory bandwidth (counting the total bytes read as num_threads * sizeof(texture element).