Calculating memory stalls due to zero-copy access (UVA)

Hello All,

I have been using DGL to profile some experiments for GNN training. I have a training pipeline ,where we are accessing data stored on CPU-DRAM (UVA mapped) via zero-copy over PCIe. And to compare the times we access the same data from GPU memory.
In both the cases, kernels perform some operations on the data, basically creating graph out of it.

We want to relate the stalls/extra time taken due to UVA access.

Are there any specific metrics in nsight compute or nsys, which can compare both cases and conclude that the UVA solution takes more time due to memory accesses over PCIe?

Nsight Systems is the tool that can give visibility into UVA memory. I’ll move this query into that forum.

@rknight, can you help with this one.

Hi utkrishtp,

What GPU is being used in your profile?

I assume DGL stands for ‘device graph launch’. If not, can you define this acronym?

Hello @rknight

DGL - Deep Graph Library, framework for GNN workloads.
GPU being used : NVIDIA TITAN RTX and RTX A6000

I’m not sure if this question makes sense but are you accessing the data in both cases via a kernel running on the GPU? If so, is it the same kernel and you are evaluating the efficiency of the two different methods?

Yes, so we have two data-structures, graph stored in CSC format, and a 2D matrix of features.
When both of them fit in the GPU memory we leverage HBM to access data, so when I profile using NCU, I see memory boundedness due to access to HBM.

For certain large-scale datasets (> 100G), we map both the data-structures as UVA and access over PCIe.

Yes it is the same kernel and I want to evaluate the efficiency of both the above methods.

Is there any specific metric that can tell or show the low compute utilization is due to PCIe accesses incurred in one of the methods?

I believe your GPUs support the Nsight System’s GPU Metrics feature. With GPU Metrics, you can measure the utilization of the PCIe bus and should be able to tell when your workload is using CPU memory via the PCIe bus or not.