I’m trying to understand how texture cache helps perf. I wrote a kernel like this:
texfetch(tex, x); //warm up the cache
start = clock();
value = __int_as_float(start) - texfetch(tex, x); //create dependency on start
ivalue = __float_as_int(value);
end = ivalue - clock(); //create dependency on value
latency[tid] = (ivalue - end) - start;
When I print the latency at the host side, it reports 200-300 cycles. Suppose the second texfetch has a cache hit and the register RAW latency/conversion latency should be <100 cycles, why does the second texfetch still take so long?
Thanks in advance!