Hi, I’m trying to measure the aggregate throughput between SMs and L1 cache/SMEM when running my code. Initially, I thought
gld_throughput is the metric what I was looking for, but
gld_throughput doesn’t seem to cover local, texture and shared memory loads.
So I’m now using the sum of
shared_load_throughput. Is this method sound?
You should really be using Nsight Compute to profile architecture Volta and newer. I’ve provided links below to help you get up and running.
What exactly are you trying to measure?
Thanks, I will use Nsight Compute instead.
I am trying to make a hierarchical roofline plot to analyze my kernel and need to measure L1, L2 and HBM throughput. The paper below does the same analysis I’d like to achieve. Hope this is clear enough.
nsight compute has a roofline capability built in
I tried but it’s only showing the HBM roofline. Is there a way to plot L1 and L2 rooflines?