Measuring L1/SMEM throughput on V100 using nvprof

Hi, I’m trying to measure the aggregate throughput between SMs and L1 cache/SMEM when running my code. Initially, I thought gld_throughput is the metric what I was looking for, but gld_throughput doesn’t seem to cover local, texture and shared memory loads.

So I’m now using the sum of gld_throughput, local_load_throughput, tex_cache_throughput and shared_load_throughput. Is this method sound?

Thank you.

You should really be using Nsight Compute to profile architecture Volta and newer. I’ve provided links below to help you get up and running.

What exactly are you trying to measure?

https://developer.nvidia.com/blog/using-nsight-compute-to-inspect-your-kernels/

https://docs.nvidia.com/cupti/Cupti/r_main.html#r_host_derived_metrics_api

Thanks, I will use Nsight Compute instead.

I am trying to make a hierarchical roofline plot to analyze my kernel and need to measure L1, L2 and HBM throughput. The paper below does the same analysis I’d like to achieve. Hope this is clear enough.

nsight compute has a roofline capability built in

I tried but it’s only showing the HBM roofline. Is there a way to plot L1 and L2 rooflines?