How to get dram throughput in Nsight system?

Hello, I have a question about dram throughput in Nsight system.
I found Nsight compute only has one value for each kernel: DRAM Through [%] (Property,Value
Name,gpu__dram_throughput.avg.pct_of_peak_sustained_elapsed
).

But in Nsight system, there are DRAM Read Bandwidth and DRAM Write Bandwidth, and for one kernel (such as kernel’s duration is10ms), there are many DRAM Read Bandwidth and DRAM Write Bandwidth values.

So I want to know how to get each kernel’s DRAM bandwidth in Nisght system.

@pkovalenko

Thank you for your patience. This can be accessed through a custom recipe that we’ll prepare and share in a few days.

gpu_metric_util_sum.tar.gz (2.7 KB)

This custom recipe calculates the average DRAM Read Throughput and DRAM Write Throughput values for each kernel. To use it, simply move this directory to target-linux-x64/python/packages/nsys_recipe/recipes. The output will be a directory of files, and the calculated values can be found in stats.csv.

Sorry, I can’t find stats.csv.
How I can get this file?
After I profile the code, I only get tmp.nsys-rep and tmp.sqlite
(CUDA_VISIBLE_DEVICES=0,1 nsys profile --gpu-metrics-devices=cuda-visible -o ${profile_file} --stats=true …)

[xx.xx@xxxx tests]$ nsys stats --report gpu_metric_util_sum ./tmp.sqlite
Processing [./tmp.sqlite] with [gpu_metric_util_sum]…
ERROR: Report ‘gpu_metric_util_sum’ could not be found.

To use a recipe, you should run nsys recipe:

nsys recipe gpu_metric_util_sum --input tmp.nsys-rep --output gpu_metric_util_sum

This will create a gpu_metric_util_sum directory. Inside it you’ll find stats.csv.

You have done your profiling, after that use the recipe that Pavel supplied with the .nsys-rep as your input. That will generate the .CSV file.

I got the below error.

[xx.xx@xxxx tests]$ /opt/nvidia/nsight-systems/2024.5.1/target-linux-x64/nsys recipe gpu_metric_util_sum --input tmp.nsys-rep --output gpu_metric_util_sum
concurrent.futures.process._RemoteTraceback:
“”"
Traceback (most recent call last):
File “/usr/lib64/python3.9/concurrent/futures/process.py”, line 243, in _process_worker
r = call_item.fn(*call_item.args, **call_item.kwargs)
File “/usr/lib64/python3.9/concurrent/futures/process.py”, line 202, in _process_chunk
return [fn(*args) for args in chunk]
File “/usr/lib64/python3.9/concurrent/futures/process.py”, line 202, in
return [fn(*args) for args in chunk]
File “/opt/nvidia/nsight-systems/2024.5.1/target-linux-x64/python/packages/nsys_recipe/recipes/gpu_metric_util_sum/gpu_metric_util_sum.py”, line 169, in _mapper_func
stats_df = GpuMetricUtilSum.calculate_stats(range_df, gpu_metrics_df)
File “/opt/nvidia/nsight-systems/2024.5.1/target-linux-x64/python/packages/nsys_recipe/recipes/gpu_metric_util_sum/gpu_metric_util_sum.py”, line 107, in calculate_stats
means = overlap_df[
File “/data/node07/xx.xx/envs/alpaca/lib64/python3.9/site-packages/pandas/core/frame.py”, line 4108, in getitem
indexer = self.columns._get_indexer_strict(key, “columns”)[1]
File “/data/node07/xx.xx/envs/alpaca/lib64/python3.9/site-packages/pandas/core/indexes/base.py”, line 6200, in _get_indexer_strict
self._raise_if_missing(keyarr, indexer, axis_name)
File “/data/node07/xx.xx/envs/alpaca/lib64/python3.9/site-packages/pandas/core/indexes/base.py”, line 6249, in _raise_if_missing
raise KeyError(f"None of [{key}] are in the [{axis_name}]“)
KeyError: “None of [Index([‘DRAM Read Throughput’, ‘DRAM Write Throughput’], dtype=‘object’)] are in the [columns]”
“””

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File “/usr/lib64/python3.9/runpy.py”, line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File “/usr/lib64/python3.9/runpy.py”, line 87, in _run_code
exec(code, run_globals)
File “/opt/nvidia/nsight-systems/2024.5.1/target-linux-x64/python/packages/nsys_recipe/main.py”, line 133, in
main()
File “/opt/nvidia/nsight-systems/2024.5.1/target-linux-x64/python/packages/nsys_recipe/main.py”, line 121, in main
recipe, exit_code = run_recipe(remaining_args[0], remaining_args[1:])
File “/opt/nvidia/nsight-systems/2024.5.1/target-linux-x64/python/packages/nsys_recipe/log.py”, line 47, in wrapper
return func(*args, **kwargs)
File “/opt/nvidia/nsight-systems/2024.5.1/target-linux-x64/python/packages/nsys_recipe/main.py”, line 57, in run_recipe
recipe.run(context)
File “/opt/nvidia/nsight-systems/2024.5.1/target-linux-x64/python/packages/nsys_recipe/recipes/gpu_metric_util_sum/gpu_metric_util_sum.py”, line 217, in run
mapper_res = self.mapper_func(context)
File “/opt/nvidia/nsight-systems/2024.5.1/target-linux-x64/python/packages/nsys_recipe/log.py”, line 47, in wrapper
return func(*args, **kwargs)
File “/opt/nvidia/nsight-systems/2024.5.1/target-linux-x64/python/packages/nsys_recipe/recipes/gpu_metric_util_sum/gpu_metric_util_sum.py”, line 182, in mapper_func
context.map(
File “/opt/nvidia/nsight-systems/2024.5.1/target-linux-x64/python/packages/nsys_recipe/lib/recipe.py”, line 237, in map
return [*self._executor.map(partial_func, *iterables)]
File “/usr/lib64/python3.9/concurrent/futures/process.py”, line 559, in _chain_from_iterable_of_lists
for element in iterable:
File “/usr/lib64/python3.9/concurrent/futures/_base.py”, line 600, in result_iterator
yield fs.pop().result()
File “/usr/lib64/python3.9/concurrent/futures/_base.py”, line 440, in result
return self.__get_result()
File “/usr/lib64/python3.9/concurrent/futures/_base.py”, line 389, in __get_result
raise self._exception
KeyError: “None of [Index([‘DRAM Read Throughput’, ‘DRAM Write Throughput’], dtype=‘object’)] are in the [columns]”

Oh I see. We have two name variations of similar metrics - DRAM Bandwidth and DRAM Throughput, and this recipe was made for Throughput one. You can just search & replace DRAM Read Throughput with DRAM Read Bandwidth and it should work.

INFO: memcpy-only_commsize64Mto13G_2GPU_outplace_inplace.nsys-rep: Exporting [‘GPU_METRICS’, ‘TARGET_INFO_GPU_METRICS’, ‘CUPTI_ACTIVITY_KIND_KERNEL’, ‘StringIds’] to memcpy-only_commsize64Mto13G_2GPU_outplace_inplace_pqtdir…
concurrent.futures.process._RemoteTraceback:
“”"
Traceback (most recent call last):
File “/usr/lib64/python3.9/concurrent/futures/process.py”, line 243, in _process_worker
r = call_item.fn(*call_item.args, **call_item.kwargs)
File “/usr/lib64/python3.9/concurrent/futures/process.py”, line 202, in _process_chunk
return [fn(*args) for args in chunk]
File “/usr/lib64/python3.9/concurrent/futures/process.py”, line 202, in
return [fn(*args) for args in chunk]
File “/opt/nvidia/nsight-systems/2024.5.1/target-linux-x64/python/packages/nsys_recipe/recipes/gpu_metric_util_sum/gpu_metric_util_sum.py”, line 170, in _mapper_func
stats_df = GpuMetricUtilSum.calculate_stats(range_df, gpu_metrics_df)
File “/opt/nvidia/nsight-systems/2024.5.1/target-linux-x64/python/packages/nsys_recipe/recipes/gpu_metric_util_sum/gpu_metric_util_sum.py”, line 107, in calculate_stats
means = overlap_df[
File “/data/node07/zao.yao/envs/alpaca/lib64/python3.9/site-packages/pandas/core/frame.py”, line 4108, in getitem
indexer = self.columns._get_indexer_strict(key, “columns”)[1]
File “/data/node07/zao.yao/envs/alpaca/lib64/python3.9/site-packages/pandas/core/indexes/base.py”, line 6200, in _get_indexer_strict
self._raise_if_missing(keyarr, indexer, axis_name)
File “/data/node07/zao.yao/envs/alpaca/lib64/python3.9/site-packages/pandas/core/indexes/base.py”, line 6249, in _raise_if_missing
raise KeyError(f"None of [{key}] are in the [{axis_name}]“)
KeyError: “None of [Index([‘DRAM Read Bandwidth’, ‘DRAM Write Bandwidth’], dtype=‘object’)] are in the [columns]”
“””

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File “/usr/lib64/python3.9/runpy.py”, line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File “/usr/lib64/python3.9/runpy.py”, line 87, in _run_code
exec(code, run_globals)
File “/opt/nvidia/nsight-systems/2024.5.1/target-linux-x64/python/packages/nsys_recipe/main.py”, line 133, in
main()
File “/opt/nvidia/nsight-systems/2024.5.1/target-linux-x64/python/packages/nsys_recipe/main.py”, line 121, in main
recipe, exit_code = run_recipe(remaining_args[0], remaining_args[1:])
File “/opt/nvidia/nsight-systems/2024.5.1/target-linux-x64/python/packages/nsys_recipe/log.py”, line 47, in wrapper
return func(*args, **kwargs)
File “/opt/nvidia/nsight-systems/2024.5.1/target-linux-x64/python/packages/nsys_recipe/main.py”, line 57, in run_recipe
recipe.run(context)
File “/opt/nvidia/nsight-systems/2024.5.1/target-linux-x64/python/packages/nsys_recipe/recipes/gpu_metric_util_sum/gpu_metric_util_sum.py”, line 218, in run
mapper_res = self.mapper_func(context)
File “/opt/nvidia/nsight-systems/2024.5.1/target-linux-x64/python/packages/nsys_recipe/log.py”, line 47, in wrapper
return func(*args, **kwargs)
File “/opt/nvidia/nsight-systems/2024.5.1/target-linux-x64/python/packages/nsys_recipe/recipes/gpu_metric_util_sum/gpu_metric_util_sum.py”, line 183, in mapper_func
context.map(
File “/opt/nvidia/nsight-systems/2024.5.1/target-linux-x64/python/packages/nsys_recipe/lib/recipe.py”, line 237, in map
return [*self._executor.map(partial_func, *iterables)]
File “/usr/lib64/python3.9/concurrent/futures/process.py”, line 559, in _chain_from_iterable_of_lists
for element in iterable:
File “/usr/lib64/python3.9/concurrent/futures/_base.py”, line 600, in result_iterator
yield fs.pop().result()
File “/usr/lib64/python3.9/concurrent/futures/_base.py”, line 440, in result
return self.__get_result()
File “/usr/lib64/python3.9/concurrent/futures/_base.py”, line 389, in __get_result
raise self._exception
KeyError: “None of [Index([‘DRAM Read Bandwidth’, ‘DRAM Write Bandwidth’], dtype=‘object’)] are in the [columns]”

My bad. Here are the correct names:

["DRAM Read Bandwidth [Throughput %]", "DRAM Write Bandwidth [Throughput %]"]

Keep in mind that 100% throughput for a given kernel in the output csv would mean that this particular kernel takes 100% throughput only when there’s no other kernels running simultaneously in concurrent CUDA streams.
Also, the calculations are based on binary inclusion, meaning any kernels that do not include at least one sampling point are excluded from the output.