L2 cache read misses vs L2 cache write misses

Waruna · November 7, 2013, 9:42pm

Hi all,

When I profile my application using nvprof 5.0
I see 90% L2 cache read hits but 3% L2 cache write hits.

I basically read from a global array and do some computations and write back to the same array.

Can anyone help me figure out what causes 97% L2 cache write misses. I expected both read and write L2 cache hits to be almost the same.

let me know if you need more information.

appreciate your guidance.

GPU: GTX 480
I have skipped the L1 cache by using compiler option “-Xptxas -dlcm=cg”

Thanks,
Waruna

P.S. Sorry, I have cross posted the same in Visual Profiler forum, then I found this forum suits better for my query.

seibert · November 11, 2013, 1:41pm

How big is the array?

Waruna · January 9, 2014, 1:05am

Hi seibert,

Sorry for the late reply.

I have formulate a simple sample. In this case the results are not the same as I have mentioned in my original post. But still the theoretical number of cache misses does not tally with the output from the profiler (nvprof).

The sample kernel is for vector addition.

kernel code

global void AddVectors(const float* A, const float* B, float* C, int N)
{
int blockStartIndex = blockIdx.x * blockDim.x * N;
int threadStartIndex = blockStartIndex + threadIdx.x;
int threadEndIndex = threadStartIndex + ( N * blockDim.x );
int i;

for( i=threadStartIndex; i<threadEndIndex; i+=blockDim.x ){
    C[i] = A[i] + B[i];
}

}

N is the number of elements processed by a single thread.

in my experiments N = 1

Size of the array = 60 * 128 * N = 60 * 128 * 1 floats = 30K bytes
grid size = 60
threads per block = 128
GPU: GTX 480
CUDA 5.0
L1 cache is disabled using “–ptxas-options -dlcm=cg” compiler option

Theoretical results:
L2 read requests = 960 + 960 = 1920 (32 byte accesses)
L2 write requests = 960 (32 byte accesses)

All of above should be L2 cache misses.

Experimental results using nvprof:

L2 write cache misses = l2_subp0_write_sector_misses + l2_subp1_write_sector_misses = 480 + 480 = 960 (32 byte accesses)
L2 read cache misses = l2_subp0_read_sector_misses + l2_subp1_read_sector_misses = 16 + 14 = 30 (32 byte accesses)

Now this is the problem. Theoretically there should be 1920 cache misses but nvprof only report 30, for me this looks impossible.

It would be great if someone can shed some light. I can also complete program if someone is interested. Please let me know if you need more details.

Thanks in advanced.
Waruna

Waruna · January 9, 2014, 1:24am

Further more, when I used N = 2, which doubles the size of the array

As expected the number of L2 write cache misses doubles. But the number of L2 read cache misses stays the same as 30

Waruna · January 17, 2014, 4:43pm

It seems like that when I use cudaMemcpy to copy the arrays from host to device, all the arrays get stored in L2 cache. Therefore, all the reads become cache hits.

My next question is, I also copy array C from host to device. But all the writes to C become cache misses? Is this the expected behavior?

In which situations I can expect write cache hits?

Thanks,
Waruna

Waruna · February 5, 2014, 3:45am

It seems that the L2 cache is a write through cache, therefore all the write requests to L2 will be misses.

Thanks
Waruna

Topic		Replies	Views
L2 cache read misses vs L2 cache write misses Visual Profiler and nvprof	0	1322	November 7, 2013
Understanding the functioning of nvprof and .cv data load option CUDA Programming and Performance	8	3039	December 11, 2014
Memory transaction size CUDA Programming and Performance	1	1710	February 12, 2017
L2 read/write misses greater than requests CUDA Programming and Performance	11	3030	May 11, 2011
Fermi Cache Architecture Cache, write policy, read policy, architecture CUDA Programming and Performance	6	3212	August 31, 2011
Ampere GPU L2 cache write miss policy CUDA Programming and Performance	3	901	February 9, 2022
L1 and L2 cache hit rate CUDA Programming and Performance	8	6449	February 3, 2016
L2 cache rate profiled in nsight compute is confused Nsight Compute	5	2592	July 3, 2024
Unbalanced Memory Read & Write CUDA Programming and Performance cuda	3	296	June 29, 2023
Difference between L2 read/write transactions and L2_L1 read/write transactions ? CUDA Programming and Performance	3	1286	August 28, 2019

L2 cache read misses vs L2 cache write misses

Related topics