For some programs (not only one) I see that for most of the kernels, cache utilizations (l2 and unified) are low. The programs are not toy and simple. Is that normal? The device is M2000.
I would like to know how cache utilization is measured? I didn’t find any explanation about that in the documents.
@mahmood.nt Can you please provide additional information on what metrics you are collecting.
This response applies to the Maxwell and Pascal architecture. The M2000 is a Maxwell GPU.
In M2000 the L1 and TEX cache are unified. Each SM has two separate unified caches. The CUPTI metrics are not very good throughput/utilization metrics. The unified cache has multiple throughputs that should be used to determine utilization. These include:
- sm to tex interface (request and data)
- tex to sm interface (return data)
- tex to l2 interface (request and data)
- l2 to tex interface (return data)
- tex unit sampling, tag lookup, filtering, and and data stages
The CUPTI metric tex_utilization only measures (2) tex to sm interface.
The L2 cache has multiple throughput metrics that should be used to determine utilization. These include:
- client to L2 slice interface (request and data)
- L2 slice to client interface (return data)
- L2 slice to memory controller interface (request and store data)
- memory controller to L2 slice interface (return data)
- tag lookup
- L2 slice to atomic unit interface
- L2 slice to CROP/ZROP interface
The CUPTI metric l2_utilization is the total read/write bytes from/to L2 divided by the theoretical maximum read/writes normalized by the ratio of SMs to L2 slices.
The Perfworks library used by Nsight Compute and Nsight VSE CUDA Profiler cover all of the above metrics whereas CUPTI has a simplified version.
tex_utilization will be low in the following cases:
- kernel is compute limited
- kernel is latency limited (waiting on data returns from L2)
- kernel has divergent memory accesses reducing read B/W
- kernel is store, not load heavy
- kernel is using texture feteches with lower throughput filtering (trilinear, anisotropic).
In the case of (3-5) the metric should be reporting high utilization. Perfworks based tools would provide a high utilization for cases 3-5.
Thanks for the explanation. I will provide some real data.
Can you explain this: theoretical maximum read/writes normalized by the ratio of SMs to L2 slices.
I use nvprof command which is known to be the best tool, AFAK. Does that use CUPTI?
Is there a metric to find out waiting on data returns from L2?
I have uploaded a picture which is the profiling result for a tensorflow job. https://pasteboard.co/IeQS9Vy.png
As you can see L2 utilization is always low. For high issue slot utilization (the last two rows), it may be reasonable that cache utilization is low. However, the (last-3) row shows that issue slot utilization is low while device memory utilization is high.
So, why that happens?
I have to say that the tensor flow job uses all 4GB of the M2000 memory.