Calculating memory stalls due to zero-copy access (UVA)

utkrishtp · October 29, 2023, 8:01am

Hello All,

I have been using DGL to profile some experiments for GNN training. I have a training pipeline ,where we are accessing data stored on CPU-DRAM (UVA mapped) via zero-copy over PCIe. And to compare the times we access the same data from GPU memory.
In both the cases, kernels perform some operations on the data, basically creating graph out of it.

We want to relate the stalls/extra time taken due to UVA access.

Are there any specific metrics in nsight compute or nsys, which can compare both cases and conclude that the UVA solution takes more time due to memory accesses over PCIe?

jmarusarz · November 2, 2023, 8:26pm

Nsight Systems is the tool that can give visibility into UVA memory. I’ll move this query into that forum.

hwilper · November 3, 2023, 3:45pm

@rknight, can you help with this one.

rknight · November 8, 2023, 5:42pm

Hi utkrishtp,

What GPU is being used in your profile?

I assume DGL stands for ‘device graph launch’. If not, can you define this acronym?

utkrishtp · November 14, 2023, 9:40am

Hello @rknight

DGL - Deep Graph Library, framework for GNN workloads.
GPU being used : NVIDIA TITAN RTX and RTX A6000

rknight · November 15, 2023, 9:44pm

I’m not sure if this question makes sense but are you accessing the data in both cases via a kernel running on the GPU? If so, is it the same kernel and you are evaluating the efficiency of the two different methods?

utkrishtp · November 17, 2023, 3:09pm

Yes, so we have two data-structures, graph stored in CSC format, and a 2D matrix of features.
When both of them fit in the GPU memory we leverage HBM to access data, so when I profile using NCU, I see memory boundedness due to access to HBM.

For certain large-scale datasets (> 100G), we map both the data-structures as UVA and access over PCIe.

Yes it is the same kernel and I want to evaluate the efficiency of both the above methods.

Is there any specific metric that can tell or show the low compute utilization is due to PCIe accesses incurred in one of the methods?

rknight · November 17, 2023, 9:49pm

I believe your GPUs support the Nsight System’s GPU Metrics feature. With GPU Metrics, you can measure the utilization of the PCIe bus and should be able to tell when your workload is using CPU memory via the PCIe bus or not.

Topic		Replies	Views
Page fault profiling Profiling Linux Targets	2	869	September 6, 2023
GPU Memory Utilization Nsight Graphics	2	1268	February 3, 2020
Explaining memory usage mismatch between nvidia-smi and Nsight System Profiling Linux Targets	2	441	March 26, 2025
NVIDIA Nsight System: How can I use NVIDIA Nsight System analysis my project? Profiling x86 Windows Targets cuda , ubuntu	0	84	July 28, 2024
How to get the compute and memory throughput of GPU from the perspective of the whole GPU system Nsight Compute cuda	4	1234	September 23, 2022
Monitoring GPU utilization of dGPU on DRIVE AGX Pegasus DRIVE AGX Xavier General drive-devtools	13	2331	October 12, 2021
Optimizing Memory with NVIDIA Nsight Systems Technical Blog	1	452	June 28, 2023
Nsight graphics tracing memory failure? Nsight Graphics	4	499	August 10, 2022
Feature request for average GPU utlization Profiling Linux Targets	7	756	April 24, 2019
nVidia Nsight CUDA profiling peak memory usage CUDA Programming and Performance	0	695	November 1, 2016

Calculating memory stalls due to zero-copy access (UVA)

Related topics