GB10 Hardware Baseline — First Direct Measurements
nvidia-uma-fault-probe v1.2.0 — built from community data
Three low-level probes that measure what profilers cannot on GB10:
uma_probe — cycle-accurate memory access latency
uma_atomic — NVLink-C2C coherence cost at atomic scope
uma_bw — raw LPDDR5X bandwidth under real workload
Data is essential. The community relies on it to build and
optimize model workflows, so visibility into what is happening
inside the hardware is just as important.
Having data to analyze from the community, along with feedback
on the tools, was essential — from reading and analyzing
sosreports, kernel logs, bug reports, and real field data.
Without that, it would all be speculation.
To my main contributor, @azampatti, who stayed with me through
it all — patient and committed — he had the hardware, I had
the data, and together we transformed it into actionable
insights.
Motivation was the Nsight Systems UVM profiling gap:
So on GB10:
Nsight UVM trace — unsupported
CUPTI UVM events — limited, scope issues
NVML memory clock — not exposed by driver (that’s why peak BW shows 0)
The tools measure from inside the kernel using inline PTX
instructions compiled natively by nvcc. No API callbacks,
no driver hooks, no profiler overhead.
%clock64 — hardware cycle counter read directly from inside
the executing kernel. Not a timer API call — the actual GPU
clock register.
ld.global.cv — cache-volatile load that bypasses L1 and L2,
forcing the access to go to the memory controller. The load
that triggers the measurement IS the measurement.
atom.global.gpu / atom.global.sys — atomic operations at GPU
scope and system scope. The difference in latency between the
two scopes is the coherence cost, measured cycle-accurate from
inside the kernel.
No CUPTI. No NVML. No profiler running alongside. The kernel
measures itself. This means no observer effect — the
measurement does not change what is being measured. The load
latency you see is the actual load latency the GPU experiences
during real workloads.
In the uma_bw output on GB10:
Peak : 0 GB/s theoretical
Note : HW_COHERENT_UMA: One LPDDR5X pool. NVLink-C2C.
Peak BW not reported (memory clock N/A on this platform).
The tool tries to derive peak bandwidth from the memory clock
via NVML. On GB10, the driver does not expose the memory clock
— cudaDeviceGetAttribute returns 0 for memory clock on GB10.
So rather than fabricate a number, the tool reports 0 and
explains why.
uma_probe — UMA Fault Latency Probe
GB10 SM 12.1 | CUDA 13.0 | Driver 580.142
=== UMA Fault Latency Probe v1.2.0 ===
GPU : NVIDIA GB10 (SM 12.1)
Platform : HARDWARE_COHERENT_UMA
Coherent : yes (hardware)
Clock : 2418 MHz
COLD p50: 16.5 ns (40 cycles)
WARM p50: 16.5 ns (40 cycles)
COLD/WARM ratio: 1.00x
Platform : HARDWARE_COHERENT_UMA
Done.
Hardware coherence resolves page state before the load
instruction executes. No measurable first-touch penalty.
LPDDR5X baseline access latency: 16.5ns / 40 cycles.
uma_atomic — NVLink-C2C Coherence Probe
GB10 SM 12.1 | CUDA 13.0 | Driver 580.142
-– Idle (VLLM loaded, model not inferencing) —
=== UMA Atomic Coherence Probe v1.1.0 ===
GPU : NVIDIA GB10 (SM 12.1)
Platform : HARDWARE_COHERENT_UMA
Coherent : yes (hardware)
Clock : 2418 MHz
Elements : 65536
Warmup : 3 runs Measure: 5 runs
Kernel : inline PTX atomics, nvcc native
PTX gpu : atom.global.gpu.add.u32
PTX sys : atom.global.sys.add.u32
GPU-scope pass (atom.global.gpu):
p50: 9.9 ns p90: 16.5 ns p99: 36.8 ns
SYS-scope pass (atom.global.sys):
p50: 9.9 ns p90: 17.8 ns p99: 35.2 ns
CONTENTION pass (sys-scope + CPU concurrent):
p50: 9.9 ns p90: 17.8 ns p99: 34.3 ns
=== Summary ===
GPU-scope p50 : 9.9 ns (24 cycles) [atom.global.gpu]
SYS-scope p50 : 9.9 ns (24 cycles) [atom.global.sys]
CONTENTION p50: 9.9 ns (24 cycles) [sys + CPU concurrent]
SYS/GPU ratio : 1.00x
Coherence cost: 0.0 ns overhead
Platform : HARDWARE_COHERENT_UMA
Done.
-– Under inference load —
=== UMA Atomic Coherence Probe v1.1.0 ===
GPU : NVIDIA GB10 (SM 12.1)
Platform : HARDWARE_COHERENT_UMA
Coherent : yes (hardware)
Clock : 2418 MHz
Elements : 65536
Warmup : 3 runs Measure: 5 runs
Kernel : inline PTX atomics, nvcc native
PTX gpu : atom.global.gpu.add.u32
PTX sys : atom.global.sys.add.u32
GPU-scope pass (atom.global.gpu):
p50: 10.3 ns p90: 26.5 ns p99: 37.2 ns
SYS-scope pass (atom.global.sys):
p50: 10.3 ns p90: 26.9 ns p99: 37.2 ns
CONTENTION pass (sys-scope + CPU concurrent):
p50: 10.3 ns p90: 26.9 ns p99: 37.2 ns
=== Summary ===
GPU-scope p50 : 10.3 ns (25 cycles) [atom.global.gpu]
SYS-scope p50 : 10.3 ns (25 cycles) [atom.global.sys]
CONTENTION p50: 10.3 ns (25 cycles) [sys + CPU concurrent]
SYS/GPU ratio : 1.00x
Coherence cost: 0.0 ns overhead
Platform : HARDWARE_COHERENT_UMA
Done.
Median atomic latency stable under load (+0.4ns).
Tail latency increases (p90: 16.5 → 26.5 ns) but coherence
overhead remains zero. NVLink-C2C hardware coherence is
transparent at atomic level under both idle and active
inference conditions.
uma_bw — LPDDR5X Bandwidth Test
GB10 SM 12.1 | CUDA 13.0 | Driver 580.142
-– Idle (VLLM loaded, model not inferencing) —
=== UMA Bandwidth Test v2.0.0 ===
GPU : NVIDIA GB10 (SM 12.1)
Platform : HARDWARE_COHERENT_UMA
Coherent : yes (hardware)
Peak : 0 GB/s theoretical
Buffer : 4 GB
Runs : 2 warmup + 5 measured
PTX read : ld.global.cg (L1 bypass)
PTX write: st.global.cs (L2 bypass, true DRAM)
Note : HW_COHERENT_UMA: One LPDDR5X pool. NVLink-C2C.
Peak BW not reported (memory clock N/A on this platform).
-– GPU (prefetched to GPU) —
GPU read : 161.31 GB/s stddev 2.82
GPU write : 116.15 GB/s stddev 0.48 [PTX .cs]
GPU copy : 164.45 GB/s [read+write]
-– CPU (prefetched to CPU) —
CPU read : 7.62 GB/s stddev 0.01
CPU write : 57.95 GB/s
-– Concurrent CPU + GPU —
GPU concurrent: 155.25 GB/s
CPU concurrent: 7.64 GB/s
Total : 162.89 GB/s
=== Summary ===
GPU read : 161.31 GB/s
GPU write : 116.15 GB/s [PTX .cs — true DRAM]
GPU copy : 164.45 GB/s
CPU read : 7.62 GB/s
CPU write : 57.95 GB/s
Conc total: 162.89 GB/s
Platform : HARDWARE_COHERENT_UMA
Done.
-– Under inference load —
=== UMA Bandwidth Test v2.0.0 ===
GPU : NVIDIA GB10 (SM 12.1)
Platform : HARDWARE_COHERENT_UMA
Coherent : yes (hardware)
Peak : 0 GB/s theoretical
Buffer : 4 GB
Runs : 2 warmup + 5 measured
PTX read : ld.global.cg (L1 bypass)
PTX write: st.global.cs (L2 bypass, true DRAM)
Note : HW_COHERENT_UMA: One LPDDR5X pool. NVLink-C2C.
Peak BW not reported (memory clock N/A on this platform).
-– GPU (prefetched to GPU) —
GPU read : 90.49 GB/s stddev 13.58
GPU write : 62.82 GB/s stddev 8.43 [PTX .cs]
GPU copy : 83.47 GB/s [read+write]
-– CPU (prefetched to CPU) —
CPU read : 5.79 GB/s stddev 0.12
CPU write : 43.19 GB/s
-– Concurrent CPU + GPU —
GPU concurrent: 96.10 GB/s
CPU concurrent: 5.83 GB/s
Total : 101.93 GB/s
=== Summary ===
GPU read : 90.49 GB/s (-44% vs idle)
GPU write : 62.82 GB/s [PTX .cs — true DRAM]
GPU copy : 83.47 GB/s
CPU read : 5.79 GB/s
CPU write : 43.19 GB/s
Conc total: 101.93 GB/s
Platform : HARDWARE_COHERENT_UMA
Done.
The 44% GPU read bandwidth drop under inference load directly
explains tok/s variability. Memory bandwidth is the bottleneck —
not coherence, not latency.
Companion tools — see full README for more information:
sparkview (live telemetry, monitoring, and event logging)
nvidia-uma-fault-probe (PTX-based latency, bandwidth, and coherence measurement)
