Variable read-only cache hit rate

I encountered that nvprof showed different read-only cache hit rates on one of my applications. To further investigate what was going on, I wrote a dummy code where I can still notice this behaviour:

__global__ void dc(const float * __restrict__ A, float *C, int numElements)
{
    int i = threadIdx.x;
	
    while (i < numElements)
    {
       C[i] = A[i];
       i += blockDim.x;
    }
	
}
... allocation, data input copy ...

dc<<<1,32>>>(A, C, 64*1024)

... data output copy, verification and deallocation ...

Note that I am launching just one warp in the kernel (single threadblock of 32 threads). Even for such a small code, when I profile with nvprof, I see that the read-only cache rates can highly vary. The hit rates that I could see on repeated runs were : 0, 50%, 75%, 87.5% and 100%.

Secondly, the metric texture cache transactions (tex_cache_transactions) was varying across the runs as well. Can someone please provide an insight into this observation?

Thanks!

On CC 3.5 to CC 5.x GPUs the profiler cannot collect all of the raw counters required for the metrics you listed in a single pass. Moreover, it cannot observe the raw counter from all texture units. Launching a single warp for a test is likely to result in the worst case results as it is likely that on some passes the profiler on observes a texture unit on which no work is allocated. If you perform an optimal launch (1 warp per warp scheduler) and the work is deterministic you will see more consistent results. A better option would be to saturate the GPU by doing many waves of consistent work.

@Greg : My objective here is to understand the texture cache behaviour. If I use multiple waves, due to the way accesses work in this scenario, I can not really reason about why the I see a a particular texture cache hit rate. e.g. if I increase the number of warps in the above simple application to 512, I see that the hit rate is shown to be 50% (across all runs). This is highly unintuitive, since the application is making coalesced memory accesses, and the texture cache block size is 32 bytes, meaning 1 in 8 float accesses is a miss.

Moreover, although I see that not having enough work would lead to sub-optimal profiler performance, what intrigues me is the 100% cache hit rate seen occasionally. IMHO, howsoever less work there may be, the profiler would end up under-estimating the hit rate, not over-estimating it.

Having defined the PM events, written the verification tests, and authored the libraries to collect the PMs I can only say good luck. The texture cache is not intuitive. It was not designed for linear accesses. Variation is due to launching a single warp.