GB10 Hardware Baseline — First Direct Measurements and Findings

parallelArchitect · April 25, 2026, 1:13pm

GB10 Hardware Baseline — First Direct Measurements
nvidia-uma-fault-probe v1.2.0 — built from community data

Three low-level probes that measure what profilers cannot on GB10:

uma_probe — cycle-accurate memory access latency
uma_atomic — NVLink-C2C coherence cost at atomic scope
uma_bw — raw LPDDR5X bandwidth under real workload

Data is essential. The community relies on it to build and
optimize model workflows, so visibility into what is happening
inside the hardware is just as important.

Having data to analyze from the community, along with feedback
on the tools, was essential — from reading and analyzing
sosreports, kernel logs, bug reports, and real field data.
Without that, it would all be speculation.

To my main contributor, @azampatti, who stayed with me through
it all — patient and committed — he had the hardware, I had
the data, and together we transformed it into actionable
insights.

Motivation was the Nsight Systems UVM profiling gap:

So on GB10:
Nsight UVM trace — unsupported
CUPTI UVM events — limited, scope issues
NVML memory clock — not exposed by driver (that’s why peak BW shows 0)

The tools measure from inside the kernel using inline PTX
instructions compiled natively by nvcc. No API callbacks,
no driver hooks, no profiler overhead.

%clock64 — hardware cycle counter read directly from inside
the executing kernel. Not a timer API call — the actual GPU
clock register.

ld.global.cv — cache-volatile load that bypasses L1 and L2,
forcing the access to go to the memory controller. The load
that triggers the measurement IS the measurement.

atom.global.gpu / atom.global.sys — atomic operations at GPU
scope and system scope. The difference in latency between the
two scopes is the coherence cost, measured cycle-accurate from
inside the kernel.

No CUPTI. No NVML. No profiler running alongside. The kernel
measures itself. This means no observer effect — the
measurement does not change what is being measured. The load
latency you see is the actual load latency the GPU experiences
during real workloads.

In the uma_bw output on GB10:

Peak : 0 GB/s theoretical
Note : HW_COHERENT_UMA: One LPDDR5X pool. NVLink-C2C.
Peak BW not reported (memory clock N/A on this platform).

The tool tries to derive peak bandwidth from the memory clock
via NVML. On GB10, the driver does not expose the memory clock
— cudaDeviceGetAttribute returns 0 for memory clock on GB10.
So rather than fabricate a number, the tool reports 0 and
explains why.

uma_probe — UMA Fault Latency Probe
GB10 SM 12.1 | CUDA 13.0 | Driver 580.142

=== UMA Fault Latency Probe v1.2.0 ===
GPU : NVIDIA GB10 (SM 12.1)
Platform : HARDWARE_COHERENT_UMA
Coherent : yes (hardware)
Clock : 2418 MHz

COLD p50: 16.5 ns (40 cycles)
WARM p50: 16.5 ns (40 cycles)
COLD/WARM ratio: 1.00x
Platform : HARDWARE_COHERENT_UMA
Done.

Hardware coherence resolves page state before the load
instruction executes. No measurable first-touch penalty.
LPDDR5X baseline access latency: 16.5ns / 40 cycles.

uma_atomic — NVLink-C2C Coherence Probe
GB10 SM 12.1 | CUDA 13.0 | Driver 580.142

-– Idle (VLLM loaded, model not inferencing) —

=== UMA Atomic Coherence Probe v1.1.0 ===
GPU : NVIDIA GB10 (SM 12.1)
Platform : HARDWARE_COHERENT_UMA
Coherent : yes (hardware)
Clock : 2418 MHz
Elements : 65536
Warmup : 3 runs Measure: 5 runs
Kernel : inline PTX atomics, nvcc native
PTX gpu : atom.global.gpu.add.u32
PTX sys : atom.global.sys.add.u32

GPU-scope pass (atom.global.gpu):
p50: 9.9 ns p90: 16.5 ns p99: 36.8 ns
SYS-scope pass (atom.global.sys):
p50: 9.9 ns p90: 17.8 ns p99: 35.2 ns
CONTENTION pass (sys-scope + CPU concurrent):
p50: 9.9 ns p90: 17.8 ns p99: 34.3 ns

=== Summary ===
GPU-scope p50 : 9.9 ns (24 cycles) [atom.global.gpu]
SYS-scope p50 : 9.9 ns (24 cycles) [atom.global.sys]
CONTENTION p50: 9.9 ns (24 cycles) [sys + CPU concurrent]
SYS/GPU ratio : 1.00x
Coherence cost: 0.0 ns overhead
Platform : HARDWARE_COHERENT_UMA
Done.

-– Under inference load —

=== UMA Atomic Coherence Probe v1.1.0 ===
GPU : NVIDIA GB10 (SM 12.1)
Platform : HARDWARE_COHERENT_UMA
Coherent : yes (hardware)
Clock : 2418 MHz
Elements : 65536
Warmup : 3 runs Measure: 5 runs
Kernel : inline PTX atomics, nvcc native
PTX gpu : atom.global.gpu.add.u32
PTX sys : atom.global.sys.add.u32

GPU-scope pass (atom.global.gpu):
p50: 10.3 ns p90: 26.5 ns p99: 37.2 ns
SYS-scope pass (atom.global.sys):
p50: 10.3 ns p90: 26.9 ns p99: 37.2 ns
CONTENTION pass (sys-scope + CPU concurrent):
p50: 10.3 ns p90: 26.9 ns p99: 37.2 ns

=== Summary ===
GPU-scope p50 : 10.3 ns (25 cycles) [atom.global.gpu]
SYS-scope p50 : 10.3 ns (25 cycles) [atom.global.sys]
CONTENTION p50: 10.3 ns (25 cycles) [sys + CPU concurrent]
SYS/GPU ratio : 1.00x
Coherence cost: 0.0 ns overhead
Platform : HARDWARE_COHERENT_UMA
Done.

Median atomic latency stable under load (+0.4ns).
Tail latency increases (p90: 16.5 → 26.5 ns) but coherence
overhead remains zero. NVLink-C2C hardware coherence is
transparent at atomic level under both idle and active
inference conditions.

uma_bw — LPDDR5X Bandwidth Test
GB10 SM 12.1 | CUDA 13.0 | Driver 580.142

-– Idle (VLLM loaded, model not inferencing) —

=== UMA Bandwidth Test v2.0.0 ===
GPU : NVIDIA GB10 (SM 12.1)
Platform : HARDWARE_COHERENT_UMA
Coherent : yes (hardware)
Peak : 0 GB/s theoretical
Buffer : 4 GB
Runs : 2 warmup + 5 measured
PTX read : ld.global.cg (L1 bypass)
PTX write: st.global.cs (L2 bypass, true DRAM)
Note : HW_COHERENT_UMA: One LPDDR5X pool. NVLink-C2C.
Peak BW not reported (memory clock N/A on this platform).

-– GPU (prefetched to GPU) —
GPU read : 161.31 GB/s stddev 2.82
GPU write : 116.15 GB/s stddev 0.48 [PTX .cs]
GPU copy : 164.45 GB/s [read+write]

-– CPU (prefetched to CPU) —
CPU read : 7.62 GB/s stddev 0.01
CPU write : 57.95 GB/s

-– Concurrent CPU + GPU —
GPU concurrent: 155.25 GB/s
CPU concurrent: 7.64 GB/s
Total : 162.89 GB/s

=== Summary ===
GPU read : 161.31 GB/s
GPU write : 116.15 GB/s [PTX .cs — true DRAM]
GPU copy : 164.45 GB/s
CPU read : 7.62 GB/s
CPU write : 57.95 GB/s
Conc total: 162.89 GB/s
Platform : HARDWARE_COHERENT_UMA
Done.

-– Under inference load —

=== UMA Bandwidth Test v2.0.0 ===
GPU : NVIDIA GB10 (SM 12.1)
Platform : HARDWARE_COHERENT_UMA
Coherent : yes (hardware)
Peak : 0 GB/s theoretical
Buffer : 4 GB
Runs : 2 warmup + 5 measured
PTX read : ld.global.cg (L1 bypass)
PTX write: st.global.cs (L2 bypass, true DRAM)
Note : HW_COHERENT_UMA: One LPDDR5X pool. NVLink-C2C.
Peak BW not reported (memory clock N/A on this platform).

-– GPU (prefetched to GPU) —
GPU read : 90.49 GB/s stddev 13.58
GPU write : 62.82 GB/s stddev 8.43 [PTX .cs]
GPU copy : 83.47 GB/s [read+write]

-– CPU (prefetched to CPU) —
CPU read : 5.79 GB/s stddev 0.12
CPU write : 43.19 GB/s

-– Concurrent CPU + GPU —
GPU concurrent: 96.10 GB/s
CPU concurrent: 5.83 GB/s
Total : 101.93 GB/s

=== Summary ===
GPU read : 90.49 GB/s (-44% vs idle)
GPU write : 62.82 GB/s [PTX .cs — true DRAM]
GPU copy : 83.47 GB/s
CPU read : 5.79 GB/s
CPU write : 43.19 GB/s
Conc total: 101.93 GB/s
Platform : HARDWARE_COHERENT_UMA
Done.

The 44% GPU read bandwidth drop under inference load directly
explains tok/s variability. Memory bandwidth is the bottleneck —
not coherence, not latency.

Companion tools — see full README for more information:

sparkview (live telemetry, monitoring, and event logging)

nvidia-uma-fault-probe (PTX-based latency, bandwidth, and coherence measurement)

azampatti · April 25, 2026, 2:36pm

Very interesting to read and explore with you this.

It was eye-opening when right after this, I could calculate my approximate tok/sec I will get for each model even before loading them based on real numbers!

thanks!

-Aldo

pontostroy · April 25, 2026, 4:06pm

I got strange results

./uma_bw
=== UMA Bandwidth Test v2.0.0 ===
GPU : NVIDIA GB10 (SM 12.1)
Platform : HARDWARE_COHERENT_UMA
Coherent : yes (hardware)
Peak : 0 GB/s theoretical
Buffer : 4 GB
Runs : 2 warmup + 5 measured
PTX read : ld.global.cg (L1 bypass)
PTX write: st.global.cs (L2 bypass, true DRAM)
Note : HW_COHERENT_UMA: One LPDDR5X pool. NVLink-C2C. Peak BW not reported (memory clock N/A on this platform).

Initializing… done

— GPU (prefetched to GPU) —
GPU read : 4587571.38 GB/s stddev 870421.22
GPU write : 5248257.32 GB/s stddev 172073.91 [PTX .cs]
GPU copy… 51622203.66 GB/s [read+write]

— CPU (prefetched to CPU) —
CPU read : 7.62 GB/s stddev 0.00
CPU write : 62.81 GB/s

— Concurrent CPU + GPU —
measuring…
GPU concurrent: 2281368.36 GB/s
CPU concurrent: 7.72 GB/s
Total : 2281376.08 GB/s

=== Summary ===
GPU read : 4587571.38 GB/s ( 0.0% of 0 GB/s peak)
GPU write : 5248257.32 GB/s [PTX .cs — true DRAM]
GPU copy : 51622203.66 GB/s
CPU read : 7.62 GB/s
CPU write : 62.81 GB/s
Conc total: 2281376.08 GB/s

Platform : HARDWARE_COHERENT_UMA
JSON : uma_bw_results.json
Done.

parallelArchitect · April 25, 2026, 7:29pm

@pontostroy Can you share:

nvidia-smi --query-gpu=driver_version,name --format=csv,noheader
nvcc --version

Also run:

./uma_probe
./uma_atomic

And share those outputs.

azampatti got clean results on GB10 with driver 580.142, CUDA 13.0 (161 GB/s idle, 90 GB/s under load).

Also — did you run sparkview alongside the tools? It monitors clock state and thermals during the run and helps identify if the system was throttled or in an unusual state. GitHub - parallelArchitect/sparkview: Operator-grade GPU monitor for NVIDIA GPUs with native GB10 / DGX Spark coherent UMA support — PSI pressure, clock detection, ConnectX-7 network layer · GitHub

pontostroy · April 26, 2026, 6:47am

nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2025 NVIDIA Corporation
Built on Tue_Dec_16_07:27:17_PM_PST_2025
Cuda compilation tools, release 13.1, V13.1.115
Build cuda_13.1.r13.1/compiler.37061995_0

With
/usr/local/cuda-13.0/bin/nvcc -O2 -std=c++17 probe_launcher.cu -o uma_probe -lcudart -lcuda -lpthread
/usr/local/cuda-13.0/bin/nvcc -O2 -std=c++17 -arch=sm_90 uma_atomic_test.cu -o uma_atomic -lcudart -lcuda -lpthread
/usr/local/cuda-13.0/bin/nvcc -O2 -std=c++17 uma_bandwidth_test.cu -o uma_bw -lcudart -lcuda -lpthread
now all works

=== UMA Bandwidth Test v2.0.0 ===
GPU : NVIDIA GB10 (SM 12.1)
Platform : HARDWARE_COHERENT_UMA
Coherent : yes (hardware)
Peak : 0 GB/s theoretical
Buffer : 4 GB
Runs : 2 warmup + 5 measured
PTX read : ld.global.cg (L1 bypass)
PTX write: st.global.cs (L2 bypass, true DRAM)
Note : HW_COHERENT_UMA: One LPDDR5X pool. NVLink-C2C. Peak BW not reported (memory clock N/A on this platform).

Initializing… done

— GPU (prefetched to GPU) —
GPU read : 165.68 GB/s stddev 2.41
GPU write : 115.89 GB/s stddev 1.30 [PTX .cs]
GPU copy… 167.27 GB/s [read+write]

— CPU (prefetched to CPU) —
CPU read : 7.63 GB/s stddev 0.00
CPU write : 64.97 GB/s

— Concurrent CPU + GPU —
measuring…
GPU concurrent: 160.56 GB/s
CPU concurrent: 7.73 GB/s
Total : 168.29 GB/s

=== Summary ===
GPU read : 165.68 GB/s ( 0.0% of 0 GB/s peak)
GPU write : 115.89 GB/s [PTX .cs — true DRAM]
GPU copy : 167.27 GB/s
CPU read : 7.63 GB/s
CPU write : 64.97 GB/s
Conc total: 168.29 GB/s

parallelArchitect · April 26, 2026, 8:22am

@pontostroy Thanks — clean uma_bw results. Still need:

nvidia-smi --query-gpu=driver_version,name --format=csv,noheader
./uma_probe
./uma_atomic

Also, the CUDA 13.1 vs 13.0 difference is worth noting.

Could you confirm the driver version you’re running?

This helps document the CUDA 13.1 event timing issue for other GB10 users who may run into the same behavior.

parallelArchitect · April 27, 2026, 6:24am

Update — CUPTI UVM Event Collection: GB10 Confirmed

Following up on the diagnostic gaps noted in the
original post.

cupti-uma-probe has now been tested on two
independent GB10 machines, both CUDA 13.0,
Driver 580.142:

[4] cuptiActivityRegisterCallbacks: SUCCESS
[5] cuptiActivityEnable(UNIFIED_MEMORY_COUNTER): FAILED
CUPTI_ERROR_NOT_READY

Total CUPTI records : 0
UVM activity events : 0
STATUS: CUPTI_UMA_ENABLE_FAILED

This confirms three diagnostic gaps on GB10:

Nsight UVM trace — unsupported
CUPTI UVM events — CUPTI_ERROR_NOT_READY at
API level on both machines tested
NVML memory clock — nvidia-smi returns N/A
(root cause of Peak BW showing 0 in uma_bw output)

Note: GB10 (SM 12.1) and hardware-coherent UMA
configurations are not explicitly covered in the
CUPTI UVM activity documentation. The expected
behavior for this platform is unclear.

Full probe source and results:

pontostroy · April 27, 2026, 8:52am

nvidia-smi --query-gpu=driver_version,name --format=csv,noheader
580.142, NVIDIA GB10

nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2026 NVIDIA Corporation
Built on Thu_Mar_19_11:11:41_PM_PDT_2026
Cuda compilation tools, release 13.2, V13.2.78
Build cuda_13.2.r13.2/compiler.37668154_0

=== uma_probe ===
=== UMA Fault Latency Probe v1.2.0 ===
GPU : NVIDIA GB10 (SM 12.1)
Platform : HARDWARE_COHERENT_UMA
Coherent : yes (hardware)
Clock : 2418 MHz
Buffer : 64 MB (16777216 elements)
Kernel : ld.global.cv + clock64 (inline PTX, nvcc native)
Note : HW_COHERENT_UMA: One physical pool. Hardware coherence active. Hardware coherence active.

COLD pass (CPU->GPU fault):
touching pages from CPU... done
running kernel... done
p50: 0.0 ns p90: 0.0 ns p99: 0.0 ns

WARM pass (GPU resident):
prefetching to GPU... done
running kernel... done
p50: 0.0 ns p90: 0.0 ns p99: 0.0 ns

PRESSURE pass (thrash):
mixed CPU/GPU residency... done
running kernel... done
p50: 0.0 ns p90: 0.0 ns p99: 0.0 ns

=== Summary ===
COLD p50: 0.0 ns ( 0 cycles)
WARM p50: 0.0 ns ( 0 cycles)
PRESS p50: 0.0 ns ( 0 cycles)
COLD/WARM ratio: 0.00x

Platform : HARDWARE_COHERENT_UMA
JSON : uma_probe_results.json
Done.
Cooling down (10s)...

=== uma_atomic ===
=== UMA Atomic Coherence Probe v1.1.0 ===
GPU : NVIDIA GB10 (SM 12.1)
Platform : HARDWARE_COHERENT_UMA
Coherent : yes (hardware)
Clock : 2418 MHz
Elements : 65536
Warmup : 3 runs Measure: 5 runs
Kernel : inline PTX atomics, nvcc native
PTX gpu : atom.global.gpu.add.u32
PTX sys : atom.global.sys.add.u32

GPU-scope pass (atom.global.gpu):
p50: 0.0 ns p90: 0.0 ns p99: 0.0 ns

SYS-scope pass (atom.global.sys):
p50: 0.0 ns p90: 0.0 ns p99: 0.0 ns

CONTENTION pass (sys-scope + CPU concurrent):
p50: 0.0 ns p90: 0.0 ns p99: 0.0 ns

=== Summary ===
GPU-scope p50 : 0.0 ns ( 0 cycles) [atom.global.gpu]
SYS-scope p50 : 0.0 ns ( 0 cycles) [atom.global.sys]
CONTENTION p50: 0.0 ns ( 0 cycles) [sys + CPU concurrent]
SYS/GPU ratio : 0.00x
Coherence cost: 0.0 ns overhead

Platform : HARDWARE_COHERENT_UMA
JSON : uma_atomic_results.json
Done.
Cooling down (10s)...

=== uma_bw ===
=== UMA Bandwidth Test v2.0.0 ===
GPU : NVIDIA GB10 (SM 12.1)
Platform : HARDWARE_COHERENT_UMA
Coherent : yes (hardware)
Peak : 0 GB/s theoretical
Buffer : 4 GB
Runs : 2 warmup + 5 measured
PTX read : ld.global.cg (L1 bypass)
PTX write: st.global.cs (L2 bypass, true DRAM)
Note : HW_COHERENT_UMA: One LPDDR5X pool. NVLink-C2C. Peak BW not reported (memory clock N/A on this platform).

Initializing... done

--- GPU (prefetched to GPU) ---
GPU read : 4664689.20 GB/s stddev 1206403.58
GPU write : 5205627.42 GB/s stddev 446497.65 [PTX .cs]
GPU copy... 53687092.56 GB/s [read+write]

--- CPU (prefetched to CPU) ---
CPU read : 7.64 GB/s stddev 0.00
CPU write : 66.29 GB/s

--- Concurrent CPU + GPU ---
measuring...
GPU concurrent: 2387721.23 GB/s
CPU concurrent: 7.73 GB/s
Total : 2387728.95 GB/s

=== Summary ===
GPU read : 4664689.20 GB/s ( 0.0% of 0 GB/s peak)
GPU write : 5205627.42 GB/s [PTX .cs — true DRAM]
GPU copy : 53687092.56 GB/s
CPU read : 7.64 GB/s
CPU write : 66.29 GB/s
Conc total: 2387728.95 GB/s

Platform : HARDWARE_COHERENT_UMA
JSON : uma_bw_results.json
Done.
Cooling down (30s)...

parallelArchitect · April 27, 2026, 10:36am

this confirms it across all three versions you tested.

Driver: 580.142 (all runs)

CUDA 13.0  — %clock64 correct, all probes valid     ✓
CUDA 13.1  — GPU timing broken, overflow results    ✗
CUDA 13.2  — %clock64 returns 0, uma_bw overflows   ✗

CPU read/write numbers are correct on all three versions
because CPU timing uses CLOCK_MONOTONIC (Linux wall clock)
— not %clock64. The failure is specific to PTX %clock64
compilation for SM 12.1 on CUDA 13.1 and 13.2.

Build requirement: CUDA 13.0 only.

/usr/local/cuda-13.0/bin/nvcc -O2 -std=c++17 \
  probe_launcher.cu -o uma_probe -lcudart -lcuda -lpthread

/usr/local/cuda-13.0/bin/nvcc -O2 -std=c++17 \
  uma_atomic_test.cu -o uma_atomic -lcudart -lcuda -lpthread

/usr/local/cuda-13.0/bin/nvcc -O2 -std=c++17 \
  uma_bandwidth_test.cu -o uma_bw -lcudart -lcuda -lpthread

Thank you for running all three versions — this is exactly
the systematic data the project needed to confirm the
CUDA version boundary on GB10.

parallelArchitect · April 30, 2026, 10:47am

Update — CUPTI Activity on GB10: post-7 finding revised

Post 7 confirmed CUPTI_ERROR_NOT_READY on UNIFIED_MEMORY_COUNTER and noted the broader CUPTI scope on GB10 was unclear.

@dustin1925 reviewed that output and built cupti_kind_sweep to answer the open question — is this a broad CUPTI failure or specific to UVM counters?

The finding: CUPTI Activity is largely functional on GB10. Three kinds are not usable on this platform:

UNIFIED_MEMORY_COUNTER — CUPTI_ERROR_NOT_READY
CONCURRENT_KERNEL — CUPTI_ERROR_NOT_COMPATIBLE
INSTRUCTION_EXECUTION — CUPTI_ERROR_LEGACY_PROFILER_NOT_SUPPORTED

All other kinds collect records normally. @azampatti validated independently on a second GB10 unit. Same kind map confirmed on both (CUDA 13.0, driver 580.142).

Kind map — two independent GB10 units:

KIND	dustin1925	azampatti
KERNEL	OK (1)	OK (1)
MEMCPY	OK (2)	OK (2)
RUNTIME	OK (7)	OK (9)
DRIVER	OK (4)	OK (3)
OVERHEAD	OK (11)	OK (14)
SYNCHRONIZATION	OK (1)	OK (2)
MEMORY2	OK (2)	OK (4)
NVLINK	OK (0)*	OK (0)*
UNIFIED_MEMORY_COUNTER	FAILED	SKIPPED
CONCURRENT_KERNEL	FAILED	SKIPPED
INSTRUCTION_EXECUTION	FAILED	SKIPPED

*NVLINK records 0 on synthetic workload — requires validation under real inference load.

From dustin1925’s sweep we built cupti_collector — a GB10-aware CUPTI Activity collector that automatically enables supported kinds and skips unsupported ones at runtime. Now a standalone library:

The correct statement is no longer “CUPTI is limited on GB10.” It is:

CUPTI Activity works on GB10 for execution tracing. UNIFIED_MEMORY_COUNTER is unavailable on this platform.

Two independent measurements are consistent with this:

uma_atomic — SYS/GPU ratio 1.00x. No measurable additional cost for SYS-scope vs GPU-scope atomics, consistent with hardware coherence operating without software-mediated migration.

uma_bw contention sweep — 2.2% GPU throughput drop under cpu-write+gpu-read. On discrete GPUs this pattern typically incurs much larger drops due to UVM-managed movement. On GB10 the impact is minimal under the same access pattern.

Together these results are consistent with a model where coherence and data movement are handled in hardware rather than via fault-driven migration. In that case the UVM fault events that UNIFIED_MEMORY_COUNTER is designed to report are not produced, so the counter has no events to emit. This is a structural limitation of that counter on hardware-coherent UMA systems, not a general CUPTI failure.

All of this work has been built and validated through community runs without direct GB10 access. Direct hardware access would enable controlled experiments, faster iteration, and full validation of edge cases.

Additional tools are in progress, including continuous hardware performance monitoring for long-run stability and drift analysis.

mashie · April 30, 2026, 2:04pm

I have no idea what half of this means but I love reading it. And if the result gives us the ability to monitor anything under the sun on this platform, rock on!

azampatti · April 30, 2026, 3:06pm

I’m pretty much on the same boat as you are in terms of knowledge, but I learnt a few things from him while helping.

Long story short, he’s measuring the capabilities of the GB10 directly to the hardware, skipping all the layers in the middle that might obfuscate or slow things down.

For instance, the effective speed that the GPU can read data from memory, he measured in my GB10 as 166GB/s, not the theoretical ~270GB/s. And actually when you ran the math with that number, that explains a lot the Tok/s I’m getting from different models.

Basically, measuring directly from the hardware will help developers understand the GB10 better and troubleshoot/fix things in a more proper way :)

(sorry @parallelArchitect if I butcher any of the technical bits here) :)

parallelArchitect · April 30, 2026, 7:35pm

@mashie — that’s exactly the right way to think about it. You don’t need to understand every measurement to benefit from it. When something goes wrong on your unit, the data exists to explain why.

@azampatti — you nailed it. That’s a perfect explanation.

This is exactly what the tooling is working toward — giving the community visibility into what their hardware is actually doing, whether it’s running a 122B model or sitting at idle. The gap right now is that the standard monitoring tools don’t expose the signals that matter on GB10. nvidia-smi can’t report memory clock. The vendor profiling stack has limited view into unified memory behavior on this platform — CUPTI UVM fault counters are structurally absent on hardware-coherent UMA, though the broader CUPTI Activity layer works and we now have a GB10-aware collector that uses the confirmed working kinds.

So we measure from the hardware directly — bandwidth, contention, atomic coherence, thermal response, power domains — and make that data available to anyone running a Spark. The goal is that when your model runs slower than expected, or the system throttles, or you hit OOM on a system that shows 90GB free, there’s a tool that tells you exactly what the hardware is doing and why.

The other piece is persistent logging. On GB10 there is no BMC, no out-of-band crash capture. When a hard reset happens the logs stop abruptly and whatever happened before the failure is gone. The tools log signals continuously so that if a failure does occur, the thermal state, power domain readings, PSI pressure, and clock behavior leading up to it are already on disk. The failure is not lost.

parallelArchitect · May 1, 2026, 9:22pm

Follow-up to the baseline measurements and observability discussion in post #10: GB10 Hardware Baseline — First Direct Measurements and Findings - #10 by parallelArchitect

As a companion to the original CUPTI baseline work, we added a small probing tool to specifically test whether CUPTI_ACTIVITY_KIND_NVLINK emits Activity records on GB10 under different workload states.

What these numbers mean

cupti-activity-collector enables three CUPTI Activity kinds and counts the records returned by the API:

Activity Type	Meaning
`KERNEL`	GPU compute launch records
`MEMCPY`	Memory transfer Activity records
`NVLINK`	NVLink Activity records

We tested three system states:

State	KERNEL	MEMCPY
Model loaded & generating	336	5376
Model loaded, idle	848	13568
Model unloaded, idle	845	13520

Interpretation of the counts:

KERNEL — number of GPU compute Activity records emitted by CUPTI, not a utilization or performance metric
MEMCPY — number of memory transfer Activity records emitted by CUPTI, not total bytes transferred

Lower KERNEL counts during generation do not imply lower GPU utilization. The generation workload appears to run fewer, longer-lived kernels, while the idle states produce many smaller runtime and framework operations.

KERNEL and MEMCPY records appear normally in all three runs, which confirms the CUPTI Activity API itself is functioning on GB10.

NVLINK Activity records remained at 0 in every state tested:

unloaded idle system
model resident but idle
active token generation workload

Important distinction:

this is not an API initialization failure
CUPTI_ACTIVITY_KIND_NVLINK enables successfully
it simply does not emit records in these runs

At this point the result is best described as:

No NVLINK Activity records observed on GB10 across the tested workload states.

Still open:

whether this is expected behavior on NVLink-C2C systems
whether future drivers or toolkits expose additional NVLINK visibility

This now joins the other currently observed GB10 observability gaps:

UNIFIED_MEMORY_COUNTER — CUPTI_ERROR_NOT_READY
nvmlDeviceGetClockInfo(NVML_CLOCK_MEM) — returns N/A
CUPTI_ACTIVITY_KIND_NVLINK — enables successfully, emits zero records in the tested workloads

Data collected by @azampatti on GB10 (SM 12.1), CUDA 13.0, CUPTI 130001, driver 580.142. Three runs: model loaded & generating, model loaded idle, model unloaded idle. Tool: cupti-activity-collector — parallelArchitect and @dustin1925.

parallelArchitect · May 10, 2026, 9:49am

Over the past week, I’ve been building gb10-kernel-probe to address a gap in GB10 / SM121a characterization tooling.

The tool runs sustained CUTLASS GEMM sweeps across tile and cluster-topology configurations while collecting hardware telemetry throughout execution.

Current sweep axes include:

threadblock tile shape
warp tile shape
pipeline stage depth
cluster topology
datatype
alignment
matrix layout

Telemetry captured per config includes:

TFLOPS
shared memory usage
occupancy
GPU temperature
power draw
SM clocks
PTX/kernel metadata

The sweep data is now exposing scheduling, thermal, power, and topology behavior during sustained tensor-core GEMM execution on GB10 systems.

New comparison data from two GB10 platforms:

ASUS GX10 (azampatti)
DGX Spark (dustin1925)

Important context:

azampatti ran the 48-config fast sweep
dustin1925 ran the full 96-config sweep (--full, all cluster shapes enabled)

=== STARTING CONDITIONS ===

azampatti (GX10):

Start temp: 56°C
Warm-start condition

dustin1925 (DGX Spark):

Start temp: 42°C
Cool-start condition

Despite the 14°C difference at sweep start, both systems converged near the same sustained operating region during tensor-core GEMM execution.

=== THERMAL BEHAVIOR ===

azampatti (GX10):

Rapid thermal rise
Plateau behavior near ~62°C
~+6°C rise during 48-config sweep

dustin1925 (DGX Spark):

Gradual thermal accumulation
Stabilized near ~62-65°C
~+20°C rise during full 96-config sweep

=== POWER / CLOCK BEHAVIOR ===

GX10:

Avg Power: ~68.4 W
Peak Power: ~76.9 W

DGX Spark:

Avg Power: ~67.7 W
Peak Power: ~81.4 W

Both systems maintained stable sustained power behavior throughout execution.

=== PERFORMANCE OBSERVATIONS ===

No sustained thermal or clock throttling was observed on either system.

One interesting result:
the highest throughput configuration did NOT correspond to the highest SM clocks.

Best config:

13.35 TFLOPS @ 2294 MHz

Lowest config:

3.97 TFLOPS @ 2398 MHz

For these GEMM kernels on GB10 / SM121a, tile shape, cluster topology, and occupancy behavior appear more influential than raw SM frequency alone.

=== CLUSTER TOPOLOGY RESULTS ===

64x64x32:

1x1x1: 4.05 TFLOPS
2x1x1: 3.99 TFLOPS
2x2x1: 3.97 TFLOPS

The smaller tile regresses slightly as cluster size increases.

128x128x32:

1x1x1: 13.20 TFLOPS
2x1x1: 13.35 TFLOPS
2x2x1: 13.10 TFLOPS

The larger tile benefits modestly from 2x1x1, then regresses again at 2x2x1.

So larger cluster topology is not acting as a universal throughput gain on GB10:

smaller tiles regress slightly
larger tiles benefit modestly from 2x1x1
larger cluster grouping does not consistently improve throughput

The analyzer layer is now exposing:

thermal trajectory
sustained power behavior
topology sensitivity
clock stability
platform convergence behavior

rather than raw benchmark numbers alone.

Huge thanks to:

azampatti for the GX10 sweep data
dustin1925 for the full sustained DGX Spark runs and validation work

Community-contributed runs are making it possible to build real comparative SM121a characterization data instead of isolated single-system observations.

Tooling + methodology:
https://github.com/parallelArchitect/gb10-kernel-probe

parallelArchitect · May 12, 2026, 11:07am

Correction both ran 48 configs each at different thermal starting state.

Topic		Replies	Views
GTX 1070 CUDA/Mem performance thread CUDA Programming and Performance	5	15395	August 8, 2016
Sparkview — GPU monitor tool with GB10-aware unified memory handling DGX Spark / GB10	38	1592	April 25, 2026
GB10 Grace CPU Energy Telemetry — Full Audit + Paper (follow-up to closed thread #360631) DGX Spark / GB10 agentic-ai	2	278	June 10, 2026
Grim memory bandwidth GTX 1080 CUDA Programming and Performance	127	32348	July 20, 2017
From low end GPUs to high end GPUs Moving from 9600GT to Tesla T10 provides no improvement, why ? CUDA Programming and Performance	24	17708	June 8, 2010
Benchmarking Different Memory Access Patterns CUDA Programming and Performance	6	1851	June 11, 2008
595.58.03 Certified Linux-aarch64 (ARM64) Display Driver and CUDA 13.2 - when for DGX Spark GB10 DGX Spark / GB10 cuda , driver	22	1348	June 17, 2026
Nsight Systems: Unified Memory Trace Support for GB10 (SM121) DGX Spark / GB10 nsight , feature-engineering , spark	5	356	February 5, 2026
Attention Lucky GTX 480/GTX 470 Owners! Please run some benchmarks for us. :) CUDA Programming and Performance	88	23196	May 5, 2010
Dearest CUTLASS TEAM, When the hell are you going to properly fix tcgen05 FP4 support for DGX Spark / GB10 (SM121)? DGX Spark / GB10	37	2465	April 25, 2026

GB10 Hardware Baseline — First Direct Measurements and Findings

What these numbers mean

Related topics