L2 cache read misses vs L2 cache write misses

Hi all,

When I profile my application using nvprof 5.0
I see 90% L2 cache read hits but 3% L2 cache write hits.

I basically read from a global array and do some computations and write back to the same array.

Can anyone help me figure out what causes 97% L2 cache write misses. I expected both read and write L2 cache hits to be almost the same.

let me know if you need more information.

appreciate your guidance.

GPU: GTX 480
I have skipped the L1 cache by using compiler option “-Xptxas -dlcm=cg”

Thanks,
Waruna

P.S. Sorry, I have cross posted the same in Visual Profiler forum, then I found this forum suits better for my query.

How big is the array?

Hi seibert,

Sorry for the late reply.

I have formulate a simple sample. In this case the results are not the same as I have mentioned in my original post. But still the theoretical number of cache misses does not tally with the output from the profiler (nvprof).

The sample kernel is for vector addition.

kernel code

global void AddVectors(const float* A, const float* B, float* C, int N)
{
int blockStartIndex = blockIdx.x * blockDim.x * N;
int threadStartIndex = blockStartIndex + threadIdx.x;
int threadEndIndex = threadStartIndex + ( N * blockDim.x );
int i;

for( i=threadStartIndex; i<threadEndIndex; i+=blockDim.x ){
    C[i] = A[i] + B[i];
}

}

N is the number of elements processed by a single thread.

in my experiments N = 1

Size of the array = 60 * 128 * N = 60 * 128 * 1 floats = 30K bytes
grid size = 60
threads per block = 128
GPU: GTX 480
CUDA 5.0
L1 cache is disabled using “–ptxas-options -dlcm=cg” compiler option

Theoretical results:
L2 read requests = 960 + 960 = 1920 (32 byte accesses)
L2 write requests = 960 (32 byte accesses)

All of above should be L2 cache misses.

Experimental results using nvprof:

L2 write cache misses = l2_subp0_write_sector_misses + l2_subp1_write_sector_misses = 480 + 480 = 960 (32 byte accesses)
L2 read cache misses = l2_subp0_read_sector_misses + l2_subp1_read_sector_misses = 16 + 14 = 30 (32 byte accesses)

Now this is the problem. Theoretically there should be 1920 cache misses but nvprof only report 30, for me this looks impossible.

It would be great if someone can shed some light. I can also complete program if someone is interested. Please let me know if you need more details.

Thanks in advanced.
Waruna

Further more, when I used N = 2, which doubles the size of the array

As expected the number of L2 write cache misses doubles. But the number of L2 read cache misses stays the same as 30

It seems like that when I use cudaMemcpy to copy the arrays from host to device, all the arrays get stored in L2 cache. Therefore, all the reads become cache hits.

My next question is, I also copy array C from host to device. But all the writes to C become cache misses? Is this the expected behavior?

In which situations I can expect write cache hits?

Thanks,
Waruna

It seems that the L2 cache is a write through cache, therefore all the write requests to L2 will be misses.

Thanks
Waruna