Is texture fetch cached? texture fetch

I’m trying to understand how texture cache helps perf. I wrote a kernel like this:

texfetch(tex, x); //warm up the cache

start = clock();
value = __int_as_float(start) - texfetch(tex, x); //create dependency on start
ivalue = __float_as_int(value);
end = ivalue - clock(); //create dependency on value
latency[tid] = (ivalue - end) - start;

When I print the latency at the host side, it reports 200-300 cycles. Suppose the second texfetch has a cache hit and the register RAW latency/conversion latency should be <100 cycles, why does the second texfetch still take so long?

Thanks in advance!

Two things to check:

  1. check the .ptx if the compiler has actually done two texfetches and has not optimized away one because of same texcoord

  2. check how many cycles are spent between cache warming and the first clock(). Might be that you run into the clock() before the cache is even filled from dev mem


Thanks! Peter.

For 1. I did check the .ptx, and two fetches are both there (I found CUDA compiler does not optimize away memory access, including tex fetch, like normal compiler does).

For 2. I deliberately insert some code to use the fetched result of the first fetch, which guarantees the cache is filled before running into the clock(). The first fetch takes more than 1700 cycles.

CUDA doc says constant cache is as fast as registers. How fast is the texture cache? Though texture cache is said to be same big as constant cache, it should be very different in structure (e.g. for 2D spatial locality).

I have also experienced strange timings with clock(). I think the main reason is that the assembler that does the .cubin moves instructions around. In some situations I got more stable timings when creating a dependency like you did and forcing a sync just before taking the first clock(). I don’t know the assembler implementation but I think it might move the start = clock() statement up in the code because it doesn’t depend on the texfetch (only “value” depends on it) thus giving wrong results. The sync seems to inhibit this.

Comments from NVIDIA ?


The compiler is careful not to move the clock instruction around. If you suspect the strange timings might be a bug, please, file a bug report with the code and we’ll take a look.

Remember that clock() measures wall-clock time, so if you want to figure out texture latency for example, the best is to run only one warp (one block of 32 threads) and do the texfetch for thread 0 only.


Also, to get real cache benefit, you need to use a 2D texture allocated with cudaMallocArray.