How do TMUs hide cachelatency ? How does SIMD-efficiency Affect TMUs' Performance?

My program uses many texture sampling instructions. So in order to improve my programs performance, I need to know more about the TMUs, which are hardly documented by Nvidia. Especially Im currently being bothered about two things:
How do TMUs hide cache//RAM latency?
How does SIMD-Efficiency affect the TMUs’ performance?

I’d appreciate your help.

Regards Fiepchen

The easiest way to think about texture units when writing code is to assume that they are
identical to caches. They hide latency in the same way that a cache would; the exact mechanism
depends on the implementation, which may change from one architecture to the next.

SIMD efficiency affects performance in the same way that it affects a cache. If some threads are
inactive, then they won’t send a request to the texture unit, and the rate at which the SM can
issue requests is reduced. There are many more details that determine texture performance and it
is probably more productive to use microbenchmarks to explore the limits of performance under different
scenarios that are meaningful to your application.

Here are some general rules of thumb that also apply to caches:

  1. The most efficient way to transfer data from the texture unit to the SM occurs when all threads in a warp access the same address. Expect performance to degrade as addresses become more divergent, since accesses cannot be aggregated into a single transaction. However, a good implementation of a texture unit will tolerate a variety of access patterns with performance that degrades gracefully with increased divergence.
  2. Getting full bandwidth out of the texture unit involves having enough requests in-flight to cover the access latency. The access latency is lower if all of the accesses hit in the cache.
  3. The cache covers multiple warps running on the same SM, and it works best if the collective working set of all of those threads is smaller than the cache size. It usually helps me to think of the entire set of threads as a single unit.

Thank you very much for your answer!

So the TMUs arent exclusively occupied by a single warp, like a Cude Core Lane or the SFUs, but behave rather more like LSUs and serve a pool of Warps at the same time in order to hide latency and increase their sampling performance?

That’s correct.

Thanks! Youve helped me very much :)