Texture cache throughput in visual profiler

I’m using CUDA 4.1 RC2 and I have a query about the texture cache throughput in visual profiler. I’m sampling from a 3D texture containing 32-bit floats. Based on my timings I’m sampling at a rate of about 17GT/s on a GTX 570. But I can’t make any sense of the counters in Visual Profiler. I have a texture cache throughput of 1260GB/s, a texture cache hit rate of 88.5% and a DRAM read throughput of 69GB/s. Now I’m using tri-linear filtering so each sampling operation should require eight 32-bit floats from the texture cache but that’s only a throughput of 544GB/s (i.e. roughly half). Secondly, if I’m getting 11.5% misses then I should presumably be seeing a DRAM read throughput of 145GB/s (i.e. roughly double). The specification for the memory bandwidth of this card is about 150GB/s. Does anyone have any ideas or suggestions?

Having enabled all of the other metrics and events I think I now mostly understand what’s going on. Each sampling operation does results in (possibly an average of) 2 reads from the texture cache of 32 bytes each. This is not all that suprising since the eight 32-bit floats needed for the interpolation are already 32 bytes anyway and the chances of being able to read exactly the right 32 bytes in a single operation are pretty slim. The texture cache hit rate then only counts hits in the texture cache itself and not the L2 cache. So I’m reading from the L2 cache at a rate of 145GB/s where I have a hit rate of about 50% with the misses showing up as DRAM read throughput. So if the apparent 1260GB/s throughput limit on the texture cache is real then I can’t really do any better without giving up on the texture unit altogether. Presumably if I’d encounter the same throughput limit if were using float4 then I’d hit the same throughput limit with bi-linear (as opposed to tri-linear) filtering.