Variable read-only cache hit rate

rasputin456 · December 10, 2014, 9:47pm

I encountered that nvprof showed different read-only cache hit rates on one of my applications. To further investigate what was going on, I wrote a dummy code where I can still notice this behaviour:

__global__ void dc(const float * __restrict__ A, float *C, int numElements)
{
    int i = threadIdx.x;
	
    while (i < numElements)
    {
       C[i] = A[i];
       i += blockDim.x;
    }
	
}
... allocation, data input copy ...

dc<<<1,32>>>(A, C, 64*1024)

... data output copy, verification and deallocation ...

Note that I am launching just one warp in the kernel (single threadblock of 32 threads). Even for such a small code, when I profile with nvprof, I see that the read-only cache rates can highly vary. The hit rates that I could see on repeated runs were : 0, 50%, 75%, 87.5% and 100%.

Secondly, the metric texture cache transactions (tex_cache_transactions) was varying across the runs as well. Can someone please provide an insight into this observation?

Thanks!

Greg · December 11, 2014, 5:17am

On CC 3.5 to CC 5.x GPUs the profiler cannot collect all of the raw counters required for the metrics you listed in a single pass. Moreover, it cannot observe the raw counter from all texture units. Launching a single warp for a test is likely to result in the worst case results as it is likely that on some passes the profiler on observes a texture unit on which no work is allocated. If you perform an optimal launch (1 warp per warp scheduler) and the work is deterministic you will see more consistent results. A better option would be to saturate the GPU by doing many waves of consistent work.

rasputin456 · December 11, 2014, 3:24pm

@Greg : My objective here is to understand the texture cache behaviour. If I use multiple waves, due to the way accesses work in this scenario, I can not really reason about why the I see a a particular texture cache hit rate. e.g. if I increase the number of warps in the above simple application to 512, I see that the hit rate is shown to be 50% (across all runs). This is highly unintuitive, since the application is making coalesced memory accesses, and the texture cache block size is 32 bytes, meaning 1 in 8 float accesses is a miss.

Moreover, although I see that not having enough work would lead to sub-optimal profiler performance, what intrigues me is the 100% cache hit rate seen occasionally. IMHO, howsoever less work there may be, the profiler would end up under-estimating the hit rate, not over-estimating it.

Greg · December 12, 2014, 8:29am

Having defined the PM events, written the verification tests, and authored the libraries to collect the PMs I can only say good luck. The texture cache is not intuitive. It was not designed for linear accesses. Variation is due to launching a single warp.