l2 read requests Number of read requests from L1 to L2 cache. This increments by 1 for each 32-byte access.
l2 write requests Number of write requests from L1 to L2 cache. This increments by 1 for each 32-byte access.
l2 read misses Number of read misses in L2 cache. This increments by 1 for each 32-byte access.
l2 write misses Number of write misses in L2 cache. This increments by 1 for each 32-byte access.
dram reads Number of read requests to DRAM. This increments by 1 for each 32-byte access.
dram writes Number of write requests to DRAM. This increments by 1 for each 32-byte access.
I consistently see that L2 read/write misses are much higher than the L2 reads/write requests. How can this happen ? Any suggestions will be appreciated.
Do you have any source to back up this claim? As far as I know, texture fetch does not consume the normal global memory bandwidth. The programming guide says that it takes a different path so that pressure on gmem bandwidth is reduced. Also, 4 MPs share a single texture cache of 24KB in CC2.0
The data in framebuffer for your display will count in L2 misses. So if you want eliminate the error, you’d better insert 2 GPU card, one for display, the other for kernel computing.
Seems like a good suggestion. I will check the profiler output after having two cards in my box. Hopefully with two GPU cards the L2/DDR cache statistics should be more reasonable.
The L1 cache resides in the SM, the L2 cache in the memory controller. AFAIK the profiler collects data only on a subset of SMs and memory controllers. So depending on how the memory accesses from the SMs map to the memory controllers, any ratio of requests seems to be possible.