GB10 Hardware Baseline — First Direct Measurements and Findings

GB10 Hardware Baseline — First Direct Measurements
nvidia-uma-fault-probe v1.2.0 — built from community data

Three low-level probes that measure what profilers cannot on GB10:

uma_probe — cycle-accurate memory access latency
uma_atomic — NVLink-C2C coherence cost at atomic scope
uma_bw — raw LPDDR5X bandwidth under real workload


Data is essential. The community relies on it to build and
optimize model workflows, so visibility into what is happening
inside the hardware is just as important.

Having data to analyze from the community, along with feedback
on the tools, was essential — from reading and analyzing
sosreports, kernel logs, bug reports, and real field data.
Without that, it would all be speculation.

To my main contributor, @azampatti, who stayed with me through
it all — patient and committed — he had the hardware, I had
the data, and together we transformed it into actionable
insights.

Motivation was the Nsight Systems UVM profiling gap:

So on GB10:
Nsight UVM trace — unsupported
CUPTI UVM events — limited, scope issues
NVML memory clock — not exposed by driver (that’s why peak BW shows 0)

The tools measure from inside the kernel using inline PTX
instructions compiled natively by nvcc. No API callbacks,
no driver hooks, no profiler overhead.

%clock64 — hardware cycle counter read directly from inside
the executing kernel. Not a timer API call — the actual GPU
clock register.

ld.global.cv — cache-volatile load that bypasses L1 and L2,
forcing the access to go to the memory controller. The load
that triggers the measurement IS the measurement.

atom.global.gpu / atom.global.sys — atomic operations at GPU
scope and system scope. The difference in latency between the
two scopes is the coherence cost, measured cycle-accurate from
inside the kernel.

No CUPTI. No NVML. No profiler running alongside. The kernel
measures itself. This means no observer effect — the
measurement does not change what is being measured. The load
latency you see is the actual load latency the GPU experiences
during real workloads.


In the uma_bw output on GB10:

Peak : 0 GB/s theoretical
Note : HW_COHERENT_UMA: One LPDDR5X pool. NVLink-C2C.
Peak BW not reported (memory clock N/A on this platform).

The tool tries to derive peak bandwidth from the memory clock
via NVML. On GB10, the driver does not expose the memory clock
— cudaDeviceGetAttribute returns 0 for memory clock on GB10.
So rather than fabricate a number, the tool reports 0 and
explains why.


uma_probe — UMA Fault Latency Probe
GB10 SM 12.1 | CUDA 13.0 | Driver 580.142

=== UMA Fault Latency Probe v1.2.0 ===
GPU : NVIDIA GB10 (SM 12.1)
Platform : HARDWARE_COHERENT_UMA
Coherent : yes (hardware)
Clock : 2418 MHz

COLD p50: 16.5 ns (40 cycles)
WARM p50: 16.5 ns (40 cycles)
COLD/WARM ratio: 1.00x
Platform : HARDWARE_COHERENT_UMA
Done.

Hardware coherence resolves page state before the load
instruction executes. No measurable first-touch penalty.
LPDDR5X baseline access latency: 16.5ns / 40 cycles.


uma_atomic — NVLink-C2C Coherence Probe
GB10 SM 12.1 | CUDA 13.0 | Driver 580.142

-– Idle (VLLM loaded, model not inferencing) —

=== UMA Atomic Coherence Probe v1.1.0 ===
GPU : NVIDIA GB10 (SM 12.1)
Platform : HARDWARE_COHERENT_UMA
Coherent : yes (hardware)
Clock : 2418 MHz
Elements : 65536
Warmup : 3 runs Measure: 5 runs
Kernel : inline PTX atomics, nvcc native
PTX gpu : atom.global.gpu.add.u32
PTX sys : atom.global.sys.add.u32

GPU-scope pass (atom.global.gpu):
p50: 9.9 ns p90: 16.5 ns p99: 36.8 ns
SYS-scope pass (atom.global.sys):
p50: 9.9 ns p90: 17.8 ns p99: 35.2 ns
CONTENTION pass (sys-scope + CPU concurrent):
p50: 9.9 ns p90: 17.8 ns p99: 34.3 ns

=== Summary ===
GPU-scope p50 : 9.9 ns (24 cycles) [atom.global.gpu]
SYS-scope p50 : 9.9 ns (24 cycles) [atom.global.sys]
CONTENTION p50: 9.9 ns (24 cycles) [sys + CPU concurrent]
SYS/GPU ratio : 1.00x
Coherence cost: 0.0 ns overhead
Platform : HARDWARE_COHERENT_UMA
Done.

-– Under inference load —

=== UMA Atomic Coherence Probe v1.1.0 ===
GPU : NVIDIA GB10 (SM 12.1)
Platform : HARDWARE_COHERENT_UMA
Coherent : yes (hardware)
Clock : 2418 MHz
Elements : 65536
Warmup : 3 runs Measure: 5 runs
Kernel : inline PTX atomics, nvcc native
PTX gpu : atom.global.gpu.add.u32
PTX sys : atom.global.sys.add.u32

GPU-scope pass (atom.global.gpu):
p50: 10.3 ns p90: 26.5 ns p99: 37.2 ns
SYS-scope pass (atom.global.sys):
p50: 10.3 ns p90: 26.9 ns p99: 37.2 ns
CONTENTION pass (sys-scope + CPU concurrent):
p50: 10.3 ns p90: 26.9 ns p99: 37.2 ns

=== Summary ===
GPU-scope p50 : 10.3 ns (25 cycles) [atom.global.gpu]
SYS-scope p50 : 10.3 ns (25 cycles) [atom.global.sys]
CONTENTION p50: 10.3 ns (25 cycles) [sys + CPU concurrent]
SYS/GPU ratio : 1.00x
Coherence cost: 0.0 ns overhead
Platform : HARDWARE_COHERENT_UMA
Done.

Median atomic latency stable under load (+0.4ns).
Tail latency increases (p90: 16.5 → 26.5 ns) but coherence
overhead remains zero. NVLink-C2C hardware coherence is
transparent at atomic level under both idle and active
inference conditions.


uma_bw — LPDDR5X Bandwidth Test
GB10 SM 12.1 | CUDA 13.0 | Driver 580.142

-– Idle (VLLM loaded, model not inferencing) —

=== UMA Bandwidth Test v2.0.0 ===
GPU : NVIDIA GB10 (SM 12.1)
Platform : HARDWARE_COHERENT_UMA
Coherent : yes (hardware)
Peak : 0 GB/s theoretical
Buffer : 4 GB
Runs : 2 warmup + 5 measured
PTX read : ld.global.cg (L1 bypass)
PTX write: st.global.cs (L2 bypass, true DRAM)
Note : HW_COHERENT_UMA: One LPDDR5X pool. NVLink-C2C.
Peak BW not reported (memory clock N/A on this platform).

-– GPU (prefetched to GPU) —
GPU read : 161.31 GB/s stddev 2.82
GPU write : 116.15 GB/s stddev 0.48 [PTX .cs]
GPU copy : 164.45 GB/s [read+write]

-– CPU (prefetched to CPU) —
CPU read : 7.62 GB/s stddev 0.01
CPU write : 57.95 GB/s

-– Concurrent CPU + GPU —
GPU concurrent: 155.25 GB/s
CPU concurrent: 7.64 GB/s
Total : 162.89 GB/s

=== Summary ===
GPU read : 161.31 GB/s
GPU write : 116.15 GB/s [PTX .cs — true DRAM]
GPU copy : 164.45 GB/s
CPU read : 7.62 GB/s
CPU write : 57.95 GB/s
Conc total: 162.89 GB/s
Platform : HARDWARE_COHERENT_UMA
Done.

-– Under inference load —

=== UMA Bandwidth Test v2.0.0 ===
GPU : NVIDIA GB10 (SM 12.1)
Platform : HARDWARE_COHERENT_UMA
Coherent : yes (hardware)
Peak : 0 GB/s theoretical
Buffer : 4 GB
Runs : 2 warmup + 5 measured
PTX read : ld.global.cg (L1 bypass)
PTX write: st.global.cs (L2 bypass, true DRAM)
Note : HW_COHERENT_UMA: One LPDDR5X pool. NVLink-C2C.
Peak BW not reported (memory clock N/A on this platform).

-– GPU (prefetched to GPU) —
GPU read : 90.49 GB/s stddev 13.58
GPU write : 62.82 GB/s stddev 8.43 [PTX .cs]
GPU copy : 83.47 GB/s [read+write]

-– CPU (prefetched to CPU) —
CPU read : 5.79 GB/s stddev 0.12
CPU write : 43.19 GB/s

-– Concurrent CPU + GPU —
GPU concurrent: 96.10 GB/s
CPU concurrent: 5.83 GB/s
Total : 101.93 GB/s

=== Summary ===
GPU read : 90.49 GB/s (-44% vs idle)
GPU write : 62.82 GB/s [PTX .cs — true DRAM]
GPU copy : 83.47 GB/s
CPU read : 5.79 GB/s
CPU write : 43.19 GB/s
Conc total: 101.93 GB/s
Platform : HARDWARE_COHERENT_UMA
Done.

The 44% GPU read bandwidth drop under inference load directly
explains tok/s variability. Memory bandwidth is the bottleneck —
not coherence, not latency.


Companion tools — see full README for more information:

sparkview (live telemetry, monitoring, and event logging)

nvidia-uma-fault-probe (PTX-based latency, bandwidth, and coherence measurement)

Very interesting to read and explore with you this.

It was eye-opening when right after this, I could calculate my approximate tok/sec I will get for each model even before loading them based on real numbers!

thanks!

-Aldo

I got strange results

./uma_bw
=== UMA Bandwidth Test v2.0.0 ===
GPU : NVIDIA GB10 (SM 12.1)
Platform : HARDWARE_COHERENT_UMA
Coherent : yes (hardware)
Peak : 0 GB/s theoretical
Buffer : 4 GB
Runs : 2 warmup + 5 measured
PTX read : ld.global.cg (L1 bypass)
PTX write: st.global.cs (L2 bypass, true DRAM)
Note : HW_COHERENT_UMA: One LPDDR5X pool. NVLink-C2C. Peak BW not reported (memory clock N/A on this platform).

Initializing… done

— GPU (prefetched to GPU) —
GPU read : 4587571.38 GB/s stddev 870421.22
GPU write : 5248257.32 GB/s stddev 172073.91 [PTX .cs]
GPU copy… 51622203.66 GB/s [read+write]

— CPU (prefetched to CPU) —
CPU read : 7.62 GB/s stddev 0.00
CPU write : 62.81 GB/s

— Concurrent CPU + GPU —
measuring…
GPU concurrent: 2281368.36 GB/s
CPU concurrent: 7.72 GB/s
Total : 2281376.08 GB/s

=== Summary ===
GPU read : 4587571.38 GB/s ( 0.0% of 0 GB/s peak)
GPU write : 5248257.32 GB/s [PTX .cs — true DRAM]
GPU copy : 51622203.66 GB/s
CPU read : 7.62 GB/s
CPU write : 62.81 GB/s
Conc total: 2281376.08 GB/s

Platform : HARDWARE_COHERENT_UMA
JSON : uma_bw_results.json
Done.

@pontostroy Can you share:

nvidia-smi --query-gpu=driver_version,name --format=csv,noheader
nvcc --version

Also run:

./uma_probe
./uma_atomic

And share those outputs.

azampatti got clean results on GB10 with driver 580.142, CUDA 13.0 (161 GB/s idle, 90 GB/s under load).

Also — did you run sparkview alongside the tools? It monitors clock state and thermals during the run and helps identify if the system was throttled or in an unusual state. GitHub - parallelArchitect/sparkview: Operator-grade GPU monitor for NVIDIA GPUs with native GB10 / DGX Spark coherent UMA support — PSI pressure, clock detection, ConnectX-7 network layer · GitHub

nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2025 NVIDIA Corporation
Built on Tue_Dec_16_07:27:17_PM_PST_2025
Cuda compilation tools, release 13.1, V13.1.115
Build cuda_13.1.r13.1/compiler.37061995_0

With
/usr/local/cuda-13.0/bin/nvcc -O2 -std=c++17 probe_launcher.cu -o uma_probe -lcudart -lcuda -lpthread
/usr/local/cuda-13.0/bin/nvcc -O2 -std=c++17 -arch=sm_90 uma_atomic_test.cu -o uma_atomic -lcudart -lcuda -lpthread
/usr/local/cuda-13.0/bin/nvcc -O2 -std=c++17 uma_bandwidth_test.cu -o uma_bw -lcudart -lcuda -lpthread
now all works

=== UMA Bandwidth Test v2.0.0 ===
GPU : NVIDIA GB10 (SM 12.1)
Platform : HARDWARE_COHERENT_UMA
Coherent : yes (hardware)
Peak : 0 GB/s theoretical
Buffer : 4 GB
Runs : 2 warmup + 5 measured
PTX read : ld.global.cg (L1 bypass)
PTX write: st.global.cs (L2 bypass, true DRAM)
Note : HW_COHERENT_UMA: One LPDDR5X pool. NVLink-C2C. Peak BW not reported (memory clock N/A on this platform).

Initializing… done

— GPU (prefetched to GPU) —
GPU read : 165.68 GB/s stddev 2.41
GPU write : 115.89 GB/s stddev 1.30 [PTX .cs]
GPU copy… 167.27 GB/s [read+write]

— CPU (prefetched to CPU) —
CPU read : 7.63 GB/s stddev 0.00
CPU write : 64.97 GB/s

— Concurrent CPU + GPU —
measuring…
GPU concurrent: 160.56 GB/s
CPU concurrent: 7.73 GB/s
Total : 168.29 GB/s

=== Summary ===
GPU read : 165.68 GB/s ( 0.0% of 0 GB/s peak)
GPU write : 115.89 GB/s [PTX .cs — true DRAM]
GPU copy : 167.27 GB/s
CPU read : 7.63 GB/s
CPU write : 64.97 GB/s
Conc total: 168.29 GB/s

@pontostroy Thanks — clean uma_bw results. Still need:

nvidia-smi --query-gpu=driver_version,name --format=csv,noheader
./uma_probe
./uma_atomic

Also, the CUDA 13.1 vs 13.0 difference is worth noting.

Could you confirm the driver version you’re running?

This helps document the CUDA 13.1 event timing issue for other GB10 users who may run into the same behavior.

Update — CUPTI UVM Event Collection: GB10 Confirmed

Following up on the diagnostic gaps noted in the
original post.

cupti-uma-probe has now been tested on two
independent GB10 machines, both CUDA 13.0,
Driver 580.142:

[4] cuptiActivityRegisterCallbacks: SUCCESS
[5] cuptiActivityEnable(UNIFIED_MEMORY_COUNTER): FAILED
CUPTI_ERROR_NOT_READY

Total CUPTI records : 0
UVM activity events : 0
STATUS: CUPTI_UMA_ENABLE_FAILED

This confirms three diagnostic gaps on GB10:

  1. Nsight UVM trace — unsupported
  2. CUPTI UVM events — CUPTI_ERROR_NOT_READY at
    API level on both machines tested
  3. NVML memory clock — nvidia-smi returns N/A
    (root cause of Peak BW showing 0 in uma_bw output)

Note: GB10 (SM 12.1) and hardware-coherent UMA
configurations are not explicitly covered in the
CUPTI UVM activity documentation. The expected
behavior for this platform is unclear.

Full probe source and results:

nvidia-smi --query-gpu=driver_version,name --format=csv,noheader
580.142, NVIDIA GB10

nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2026 NVIDIA Corporation
Built on Thu_Mar_19_11:11:41_PM_PDT_2026
Cuda compilation tools, release 13.2, V13.2.78
Build cuda_13.2.r13.2/compiler.37668154_0

=== uma_probe ===
=== UMA Fault Latency Probe v1.2.0 ===
GPU : NVIDIA GB10 (SM 12.1)
Platform : HARDWARE_COHERENT_UMA
Coherent : yes (hardware)
Clock : 2418 MHz
Buffer : 64 MB (16777216 elements)
Kernel : ld.global.cv + clock64 (inline PTX, nvcc native)
Note : HW_COHERENT_UMA: One physical pool. Hardware coherence active. Hardware coherence active.

COLD pass (CPU->GPU fault):
touching pages from CPU... done
running kernel... done
p50: 0.0 ns p90: 0.0 ns p99: 0.0 ns

WARM pass (GPU resident):
prefetching to GPU... done
running kernel... done
p50: 0.0 ns p90: 0.0 ns p99: 0.0 ns

PRESSURE pass (thrash):
mixed CPU/GPU residency... done
running kernel... done
p50: 0.0 ns p90: 0.0 ns p99: 0.0 ns

=== Summary ===
COLD p50: 0.0 ns ( 0 cycles)
WARM p50: 0.0 ns ( 0 cycles)
PRESS p50: 0.0 ns ( 0 cycles)
COLD/WARM ratio: 0.00x

Platform : HARDWARE_COHERENT_UMA
JSON : uma_probe_results.json
Done.
Cooling down (10s)...

=== uma_atomic ===
=== UMA Atomic Coherence Probe v1.1.0 ===
GPU : NVIDIA GB10 (SM 12.1)
Platform : HARDWARE_COHERENT_UMA
Coherent : yes (hardware)
Clock : 2418 MHz
Elements : 65536
Warmup : 3 runs Measure: 5 runs
Kernel : inline PTX atomics, nvcc native
PTX gpu : atom.global.gpu.add.u32
PTX sys : atom.global.sys.add.u32

GPU-scope pass (atom.global.gpu):
p50: 0.0 ns p90: 0.0 ns p99: 0.0 ns

SYS-scope pass (atom.global.sys):
p50: 0.0 ns p90: 0.0 ns p99: 0.0 ns

CONTENTION pass (sys-scope + CPU concurrent):
p50: 0.0 ns p90: 0.0 ns p99: 0.0 ns

=== Summary ===
GPU-scope p50 : 0.0 ns ( 0 cycles) [atom.global.gpu]
SYS-scope p50 : 0.0 ns ( 0 cycles) [atom.global.sys]
CONTENTION p50: 0.0 ns ( 0 cycles) [sys + CPU concurrent]
SYS/GPU ratio : 0.00x
Coherence cost: 0.0 ns overhead

Platform : HARDWARE_COHERENT_UMA
JSON : uma_atomic_results.json
Done.
Cooling down (10s)...

=== uma_bw ===
=== UMA Bandwidth Test v2.0.0 ===
GPU : NVIDIA GB10 (SM 12.1)
Platform : HARDWARE_COHERENT_UMA
Coherent : yes (hardware)
Peak : 0 GB/s theoretical
Buffer : 4 GB
Runs : 2 warmup + 5 measured
PTX read : ld.global.cg (L1 bypass)
PTX write: st.global.cs (L2 bypass, true DRAM)
Note : HW_COHERENT_UMA: One LPDDR5X pool. NVLink-C2C. Peak BW not reported (memory clock N/A on this platform).

Initializing... done

--- GPU (prefetched to GPU) ---
GPU read : 4664689.20 GB/s stddev 1206403.58
GPU write : 5205627.42 GB/s stddev 446497.65 [PTX .cs]
GPU copy... 53687092.56 GB/s [read+write]

--- CPU (prefetched to CPU) ---
CPU read : 7.64 GB/s stddev 0.00
CPU write : 66.29 GB/s

--- Concurrent CPU + GPU ---
measuring...
GPU concurrent: 2387721.23 GB/s
CPU concurrent: 7.73 GB/s
Total : 2387728.95 GB/s

=== Summary ===
GPU read : 4664689.20 GB/s ( 0.0% of 0 GB/s peak)
GPU write : 5205627.42 GB/s [PTX .cs — true DRAM]
GPU copy : 53687092.56 GB/s
CPU read : 7.64 GB/s
CPU write : 66.29 GB/s
Conc total: 2387728.95 GB/s

Platform : HARDWARE_COHERENT_UMA
JSON : uma_bw_results.json
Done.
Cooling down (30s)...

this confirms it across all three versions you tested.

Driver: 580.142 (all runs)

CUDA 13.0  — %clock64 correct, all probes valid     ✓
CUDA 13.1  — GPU timing broken, overflow results    ✗
CUDA 13.2  — %clock64 returns 0, uma_bw overflows   ✗

CPU read/write numbers are correct on all three versions
because CPU timing uses CLOCK_MONOTONIC (Linux wall clock)
— not %clock64. The failure is specific to PTX %clock64
compilation for SM 12.1 on CUDA 13.1 and 13.2.

Build requirement: CUDA 13.0 only.

/usr/local/cuda-13.0/bin/nvcc -O2 -std=c++17 \
  probe_launcher.cu -o uma_probe -lcudart -lcuda -lpthread

/usr/local/cuda-13.0/bin/nvcc -O2 -std=c++17 \
  uma_atomic_test.cu -o uma_atomic -lcudart -lcuda -lpthread

/usr/local/cuda-13.0/bin/nvcc -O2 -std=c++17 \
  uma_bandwidth_test.cu -o uma_bw -lcudart -lcuda -lpthread

Thank you for running all three versions — this is exactly
the systematic data the project needed to confirm the
CUDA version boundary on GB10.

Update — CUPTI Activity on GB10: post-7 finding revised

Post 7 confirmed CUPTI_ERROR_NOT_READY on UNIFIED_MEMORY_COUNTER and noted the broader CUPTI scope on GB10 was unclear.

@dustin1925 reviewed that output and built cupti_kind_sweep to answer the open question — is this a broad CUPTI failure or specific to UVM counters?

The finding: CUPTI Activity is largely functional on GB10. Three kinds are not usable on this platform:

UNIFIED_MEMORY_COUNTER — CUPTI_ERROR_NOT_READY
CONCURRENT_KERNEL — CUPTI_ERROR_NOT_COMPATIBLE
INSTRUCTION_EXECUTION — CUPTI_ERROR_LEGACY_PROFILER_NOT_SUPPORTED

All other kinds collect records normally. @azampatti validated independently on a second GB10 unit. Same kind map confirmed on both (CUDA 13.0, driver 580.142).

Kind map — two independent GB10 units:

KIND dustin1925 azampatti
KERNEL OK (1) OK (1)
MEMCPY OK (2) OK (2)
RUNTIME OK (7) OK (9)
DRIVER OK (4) OK (3)
OVERHEAD OK (11) OK (14)
SYNCHRONIZATION OK (1) OK (2)
MEMORY2 OK (2) OK (4)
NVLINK OK (0)* OK (0)*
UNIFIED_MEMORY_COUNTER FAILED SKIPPED
CONCURRENT_KERNEL FAILED SKIPPED
INSTRUCTION_EXECUTION FAILED SKIPPED

*NVLINK records 0 on synthetic workload — requires validation under real inference load.

From dustin1925’s sweep we built cupti_collector — a GB10-aware CUPTI Activity collector that automatically enables supported kinds and skips unsupported ones at runtime. Now a standalone library:

The correct statement is no longer “CUPTI is limited on GB10.” It is:

CUPTI Activity works on GB10 for execution tracing. UNIFIED_MEMORY_COUNTER is unavailable on this platform.

Two independent measurements are consistent with this:

uma_atomic — SYS/GPU ratio 1.00x. No measurable additional cost for SYS-scope vs GPU-scope atomics, consistent with hardware coherence operating without software-mediated migration.

uma_bw contention sweep — 2.2% GPU throughput drop under cpu-write+gpu-read. On discrete GPUs this pattern typically incurs much larger drops due to UVM-managed movement. On GB10 the impact is minimal under the same access pattern.

Together these results are consistent with a model where coherence and data movement are handled in hardware rather than via fault-driven migration. In that case the UVM fault events that UNIFIED_MEMORY_COUNTER is designed to report are not produced, so the counter has no events to emit. This is a structural limitation of that counter on hardware-coherent UMA systems, not a general CUPTI failure.

All of this work has been built and validated through community runs without direct GB10 access. Direct hardware access would enable controlled experiments, faster iteration, and full validation of edge cases.

Additional tools are in progress, including continuous hardware performance monitoring for long-run stability and drift analysis.

I have no idea what half of this means but I love reading it. And if the result gives us the ability to monitor anything under the sun on this platform, rock on!

I’m pretty much on the same boat as you are in terms of knowledge, but I learnt a few things from him while helping.

Long story short, he’s measuring the capabilities of the GB10 directly to the hardware, skipping all the layers in the middle that might obfuscate or slow things down.

For instance, the effective speed that the GPU can read data from memory, he measured in my GB10 as 166GB/s, not the theoretical ~270GB/s. And actually when you ran the math with that number, that explains a lot the Tok/s I’m getting from different models.

Basically, measuring directly from the hardware will help developers understand the GB10 better and troubleshoot/fix things in a more proper way :)

(sorry @parallelArchitect if I butcher any of the technical bits here) :)

@mashie — that’s exactly the right way to think about it. You don’t need to understand every measurement to benefit from it. When something goes wrong on your unit, the data exists to explain why.

@azampatti — you nailed it. That’s a perfect explanation.

This is exactly what the tooling is working toward — giving the community visibility into what their hardware is actually doing, whether it’s running a 122B model or sitting at idle. The gap right now is that the standard monitoring tools don’t expose the signals that matter on GB10. nvidia-smi can’t report memory clock. The vendor profiling stack has limited view into unified memory behavior on this platform — CUPTI UVM fault counters are structurally absent on hardware-coherent UMA, though the broader CUPTI Activity layer works and we now have a GB10-aware collector that uses the confirmed working kinds.

So we measure from the hardware directly — bandwidth, contention, atomic coherence, thermal response, power domains — and make that data available to anyone running a Spark. The goal is that when your model runs slower than expected, or the system throttles, or you hit OOM on a system that shows 90GB free, there’s a tool that tells you exactly what the hardware is doing and why.

The other piece is persistent logging. On GB10 there is no BMC, no out-of-band crash capture. When a hard reset happens the logs stop abruptly and whatever happened before the failure is gone. The tools log signals continuously so that if a failure does occur, the thermal state, power domain readings, PSI pressure, and clock behavior leading up to it are already on disk. The failure is not lost.

Follow-up to the baseline measurements and observability discussion in post #10: GB10 Hardware Baseline — First Direct Measurements and Findings - #10 by parallelArchitect

As a companion to the original CUPTI baseline work, we added a small probing tool to specifically test whether CUPTI_ACTIVITY_KIND_NVLINK emits Activity records on GB10 under different workload states.

What these numbers mean

cupti-activity-collector enables three CUPTI Activity kinds and counts the records returned by the API:

Activity Type Meaning
KERNEL GPU compute launch records
MEMCPY Memory transfer Activity records
NVLINK NVLink Activity records

We tested three system states:

State KERNEL MEMCPY NVLINK
Model loaded & generating 336 5376 0
Model loaded, idle 848 13568 0
Model unloaded, idle 845 13520 0

Interpretation of the counts:

  • KERNEL — number of GPU compute Activity records emitted by CUPTI, not a utilization or performance metric

  • MEMCPY — number of memory transfer Activity records emitted by CUPTI, not total bytes transferred

Lower KERNEL counts during generation do not imply lower GPU utilization. The generation workload appears to run fewer, longer-lived kernels, while the idle states produce many smaller runtime and framework operations.

KERNEL and MEMCPY records appear normally in all three runs, which confirms the CUPTI Activity API itself is functioning on GB10.

NVLINK Activity records remained at 0 in every state tested:

  • unloaded idle system

  • model resident but idle

  • active token generation workload

Important distinction:

  • this is not an API initialization failure

  • CUPTI_ACTIVITY_KIND_NVLINK enables successfully

  • it simply does not emit records in these runs

At this point the result is best described as:

No NVLINK Activity records observed on GB10 across the tested workload states.

Still open:

  • whether this is expected behavior on NVLink-C2C systems

  • whether future drivers or toolkits expose additional NVLINK visibility

This now joins the other currently observed GB10 observability gaps:

  • UNIFIED_MEMORY_COUNTERCUPTI_ERROR_NOT_READY

  • nvmlDeviceGetClockInfo(NVML_CLOCK_MEM) — returns N/A

  • CUPTI_ACTIVITY_KIND_NVLINK — enables successfully, emits zero records in the tested workloads

Data collected by @azampatti on GB10 (SM 12.1), CUDA 13.0, CUPTI 130001, driver 580.142. Three runs: model loaded & generating, model loaded idle, model unloaded idle. Tool: cupti-activity-collector — parallelArchitect and @dustin1925.


Over the past week, I’ve been building gb10-kernel-probe to address a gap in GB10 / SM121a characterization tooling.

The tool runs sustained CUTLASS GEMM sweeps across tile and cluster-topology configurations while collecting hardware telemetry throughout execution.

Current sweep axes include:

  • threadblock tile shape

  • warp tile shape

  • pipeline stage depth

  • cluster topology

  • datatype

  • alignment

  • matrix layout

Telemetry captured per config includes:

  • TFLOPS

  • shared memory usage

  • occupancy

  • GPU temperature

  • power draw

  • SM clocks

  • PTX/kernel metadata

The sweep data is now exposing scheduling, thermal, power, and topology behavior during sustained tensor-core GEMM execution on GB10 systems.

New comparison data from two GB10 platforms:

  • ASUS GX10 (azampatti)

  • DGX Spark (dustin1925)

Important context:

  • azampatti ran the 48-config fast sweep

  • dustin1925 ran the full 96-config sweep (--full, all cluster shapes enabled)

=== STARTING CONDITIONS ===

azampatti (GX10):

  • Start temp: 56°C

  • Warm-start condition

dustin1925 (DGX Spark):

  • Start temp: 42°C

  • Cool-start condition

Despite the 14°C difference at sweep start, both systems converged near the same sustained operating region during tensor-core GEMM execution.

=== THERMAL BEHAVIOR ===

azampatti (GX10):

  • Rapid thermal rise

  • Plateau behavior near ~62°C

  • ~+6°C rise during 48-config sweep

dustin1925 (DGX Spark):

  • Gradual thermal accumulation

  • Stabilized near ~62-65°C

  • ~+20°C rise during full 96-config sweep

=== POWER / CLOCK BEHAVIOR ===

GX10:

  • Avg Power: ~68.4 W

  • Peak Power: ~76.9 W

DGX Spark:

  • Avg Power: ~67.7 W

  • Peak Power: ~81.4 W

Both systems maintained stable sustained power behavior throughout execution.

=== PERFORMANCE OBSERVATIONS ===

No sustained thermal or clock throttling was observed on either system.

One interesting result:
the highest throughput configuration did NOT correspond to the highest SM clocks.

Best config:

  • 13.35 TFLOPS @ 2294 MHz

Lowest config:

  • 3.97 TFLOPS @ 2398 MHz

For these GEMM kernels on GB10 / SM121a, tile shape, cluster topology, and occupancy behavior appear more influential than raw SM frequency alone.

=== CLUSTER TOPOLOGY RESULTS ===

64x64x32:

  • 1x1x1: 4.05 TFLOPS

  • 2x1x1: 3.99 TFLOPS

  • 2x2x1: 3.97 TFLOPS

The smaller tile regresses slightly as cluster size increases.

128x128x32:

  • 1x1x1: 13.20 TFLOPS

  • 2x1x1: 13.35 TFLOPS

  • 2x2x1: 13.10 TFLOPS

The larger tile benefits modestly from 2x1x1, then regresses again at 2x2x1.

So larger cluster topology is not acting as a universal throughput gain on GB10:

  • smaller tiles regress slightly

  • larger tiles benefit modestly from 2x1x1

  • larger cluster grouping does not consistently improve throughput

The analyzer layer is now exposing:

  • thermal trajectory

  • sustained power behavior

  • topology sensitivity

  • clock stability

  • platform convergence behavior

rather than raw benchmark numbers alone.

Huge thanks to:

  • azampatti for the GX10 sweep data

  • dustin1925 for the full sustained DGX Spark runs and validation work

Community-contributed runs are making it possible to build real comparative SM121a characterization data instead of isolated single-system observations.

Tooling + methodology:
https://github.com/parallelArchitect/gb10-kernel-probe

Correction both ran 48 configs each at different thermal starting state.