L2 read/write misses greater than requests

Following is one snapshot of L2 requests, reads/writes and dram reads/writes generated by cuda visual profiler.

L2      L2      L2      l2      dram    dram

read    write   read    write   reads   writes

req     req     misses  misses

================================================

242	32	7592	130	8842	130

796	2092	2490	2916	2646	2916

204	0	2058	482	2254	482

800	2057	2460	2836	1764	2836

220	0	2066	467	2182	467

792	2089	2420	2875	2604	2875

220	0	2022	474	2170	474

l2 read requests Number of read requests from L1 to L2 cache. This increments by 1 for each 32-byte access.

l2 write requests Number of write requests from L1 to L2 cache. This increments by 1 for each 32-byte access.

l2 read misses Number of read misses in L2 cache. This increments by 1 for each 32-byte access.

l2 write misses Number of write misses in L2 cache. This increments by 1 for each 32-byte access.

dram reads Number of read requests to DRAM. This increments by 1 for each 32-byte access.

dram writes Number of write requests to DRAM. This increments by 1 for each 32-byte access.

I consistently see that L2 read/write misses are much higher than the L2 reads/write requests. How can this happen ? Any suggestions will be appreciated.

Texture access (in case of L1 texture cache miss) is not counted as L2 read request, but it is counted as L2 read miss (if it misses L2 too).

Weird… Why is texture cache read miss counted as L2 read miss? L2 cache and the texture cache are not physically the same thing!

Think of the texture cache as an alternate L1 cache. It reads its data from L2 just like the normal L1 cache does.

Essentially there were not texture cache requests for this application.

Session1 - Device_0 - Context_0 [CUDA] : Profiler table column ‘tex cache requests’ having all zero values is hidden.

Session1 - Device_0 - Context_0 [CUDA] : Profiler table column ‘tex cache misses’ having all zero values is hidden.

I am enclosing the csv file for confirmation.
global_queue.txt (23.1 KB)

Do you have any source to back up this claim? As far as I know, texture fetch does not consume the normal global memory bandwidth. The programming guide says that it takes a different path so that pressure on gmem bandwidth is reduced. Also, 4 MPs share a single texture cache of 24KB in CC2.0

Can you upload the full source code? Or you may try to scale down your kernel launch and accurately calculate the number of L2 read/write first?

The data in framebuffer for your display will count in L2 misses. So if you want eliminate the error, you’d better insert 2 GPU card, one for display, the other for kernel computing.

Seems like a good suggestion. I will check the profiler output after having two cards in my box. Hopefully with two GPU cards the L2/DDR cache statistics should be more reasonable.

what if the GPU is not attached to a monitor and the machine is accessed remotely?

did any of you ever got an answer on this?
does the second card helps?
how about performance? is there a significant difference?

The L1 cache resides in the SM, the L2 cache in the memory controller. AFAIK the profiler collects data only on a subset of SMs and memory controllers. So depending on how the memory accesses from the SMs map to the memory controllers, any ratio of requests seems to be possible.