L2 read/write misses greater than requests

jarjar · March 30, 2011, 8:19am

Following is one snapshot of L2 requests, reads/writes and dram reads/writes generated by cuda visual profiler.

L2      L2      L2      l2      dram    dram

read    write   read    write   reads   writes

req     req     misses  misses

================================================

242	32	7592	130	8842	130

796	2092	2490	2916	2646	2916

204	0	2058	482	2254	482

800	2057	2460	2836	1764	2836

220	0	2066	467	2182	467

792	2089	2420	2875	2604	2875

220	0	2022	474	2170	474

l2 read requests Number of read requests from L1 to L2 cache. This increments by 1 for each 32-byte access.

l2 write requests Number of write requests from L1 to L2 cache. This increments by 1 for each 32-byte access.

l2 read misses Number of read misses in L2 cache. This increments by 1 for each 32-byte access.

l2 write misses Number of write misses in L2 cache. This increments by 1 for each 32-byte access.

dram reads Number of read requests to DRAM. This increments by 1 for each 32-byte access.

dram writes Number of write requests to DRAM. This increments by 1 for each 32-byte access.

I consistently see that L2 read/write misses are much higher than the L2 reads/write requests. How can this happen ? Any suggestions will be appreciated.

AlexanderMalishev · March 30, 2011, 7:39pm

Texture access (in case of L1 texture cache miss) is not counted as L2 read request, but it is counted as L2 read miss (if it misses L2 too).

hyqneuron · March 31, 2011, 10:23am

Weird… Why is texture cache read miss counted as L2 read miss? L2 cache and the texture cache are not physically the same thing!

DrAnderson42 · March 31, 2011, 2:01pm

Think of the texture cache as an alternate L1 cache. It reads its data from L2 just like the normal L1 cache does.

jarjar · March 31, 2011, 4:44pm

Essentially there were not texture cache requests for this application.

Session1 - Device_0 - Context_0 [CUDA] : Profiler table column ‘tex cache requests’ having all zero values is hidden.

Session1 - Device_0 - Context_0 [CUDA] : Profiler table column ‘tex cache misses’ having all zero values is hidden.

I am enclosing the csv file for confirmation.
global_queue.txt (23.1 KB)

hyqneuron · April 1, 2011, 12:20pm

Do you have any source to back up this claim? As far as I know, texture fetch does not consume the normal global memory bandwidth. The programming guide says that it takes a different path so that pressure on gmem bandwidth is reduced. Also, 4 MPs share a single texture cache of 24KB in CC2.0

hyqneuron · April 1, 2011, 12:25pm

Can you upload the full source code? Or you may try to scale down your kernel launch and accurately calculate the number of L2 read/write first?

jinhou · April 8, 2011, 6:22am

Following is one snapshot of L2 requests, reads/writes and dram reads/writes generated by cuda visual profiler.
L2      L2      L2      l2      dram    dram

read    write   read    write   reads   writes

req     req     misses  misses

================================================

242	32	7592	130	8842	130

796	2092	2490	2916	2646	2916

204	0	2058	482	2254	482

800	2057	2460	2836	1764	2836

220	0	2066	467	2182	467

792	2089	2420	2875	2604	2875

220	0	2022	474	2170	474
l2 read requests Number of read requests from L1 to L2 cache. This increments by 1 for each 32-byte access.

l2 write requests Number of write requests from L1 to L2 cache. This increments by 1 for each 32-byte access.

l2 read misses Number of read misses in L2 cache. This increments by 1 for each 32-byte access.

l2 write misses Number of write misses in L2 cache. This increments by 1 for each 32-byte access.

dram reads Number of read requests to DRAM. This increments by 1 for each 32-byte access.

dram writes Number of write requests to DRAM. This increments by 1 for each 32-byte access.

I consistently see that L2 read/write misses are much higher than the L2 reads/write requests. How can this happen ? Any suggestions will be appreciated.

The data in framebuffer for your display will count in L2 misses. So if you want eliminate the error, you’d better insert 2 GPU card, one for display, the other for kernel computing.

jarjar · April 12, 2011, 12:30am

Seems like a good suggestion. I will check the profiler output after having two cards in my box. Hopefully with two GPU cards the L2/DDR cache statistics should be more reasonable.

RoofTopG · April 12, 2011, 2:22pm

what if the GPU is not attached to a monitor and the machine is accessed remotely?

pmcr · May 11, 2011, 2:43pm

did any of you ever got an answer on this?
does the second card helps?
how about performance? is there a significant difference?

tera · May 11, 2011, 4:15pm

The L1 cache resides in the SM, the L2 cache in the memory controller. AFAIK the profiler collects data only on a subset of SMs and memory controllers. So depending on how the memory accesses from the SMs map to the memory controllers, any ratio of requests seems to be possible.

Topic		Replies	Views
Kernel modification for math/memonly and profiler results Understanding values of dram_reads and gld CUDA Programming and Performance	6	1752	April 20, 2011
Memory transaction size CUDA Programming and Performance	1	1740	February 12, 2017
Squeasing max d2d memory bandwidth (GTX 480) CUDA Programming and Performance	15	7011	November 2, 2010
texture cache and L2 cache CUDA Programming and Performance	3	4365	March 19, 2014
Memory performance in image processing example CUDA Programming and Performance	9	1622	March 24, 2011
Texture and L1 memory bandwidth CUDA Programming and Performance	14	9808	December 14, 2011
Texture Reads What is the source of performance increase? CUDA Programming and Performance	3	5925	March 9, 2011
Texture memory performance CUDA Programming and Performance	4	4977	June 1, 2009
Understanding GPU caches can't get my head around it CUDA Programming and Performance	12	4872	March 14, 2009
Reading data CUDA Programming and Performance	12	2702	July 18, 2011

L2 read/write misses greater than requests

Related topics