this confirms it across all three versions you tested.
Driver: 580.142 (all runs)
CUDA 13.0 — %clock64 correct, all probes valid ✓
CUDA 13.1 — GPU timing broken, overflow results ✗
CUDA 13.2 — %clock64 returns 0, uma_bw overflows ✗
CPU read/write numbers are correct on all three versions
because CPU timing uses CLOCK_MONOTONIC (Linux wall clock)
— not %clock64. The failure is specific to PTX %clock64
compilation for SM 12.1 on CUDA 13.1 and 13.2.
Build requirement: CUDA 13.0 only.
/usr/local/cuda-13.0/bin/nvcc -O2 -std=c++17 \
probe_launcher.cu -o uma_probe -lcudart -lcuda -lpthread
/usr/local/cuda-13.0/bin/nvcc -O2 -std=c++17 \
uma_atomic_test.cu -o uma_atomic -lcudart -lcuda -lpthread
/usr/local/cuda-13.0/bin/nvcc -O2 -std=c++17 \
uma_bandwidth_test.cu -o uma_bw -lcudart -lcuda -lpthread
Thank you for running all three versions — this is exactly
the systematic data the project needed to confirm the
CUDA version boundary on GB10.