Question on Reproducing DGX Spark (GB10) FP4 1 PFLOPS Performance Using CUTLASS Profiler

Turtle7777 · January 12, 2026, 6:51am

Dear NVIDIA Team,
I am currently evaluating the FP4 AI performance claim stated on the NVIDIA DGX Spark (GB10) product page, which indicates that the system can deliver up to one petaFLOP of FP4 AI performance.

To better understand and validate this claim, I would like to confirm whether the following CUTLASS-based methodology is an appropriate and NVIDIA-aligned way to reproduce or approximate the advertised FP4 peak performance, or whether NVIDIA recommends a different or more official approach.

Below is the specific approach I am currently using.

Method 1 (Primary approach): CUTLASS Profiler

1. Build CUTLASS with GB10 (SM121) support

CUTLASS is built in Release mode with explicit support for the GB10 compute capability and the full kernel library enabled:

cmake .. \
  -DCUTLASS_NVCC_ARCHS=121 \
  -DCUTLASS_LIBRARY_KERNELS=all \
  -DCMAKE_BUILD_TYPE=Release
make -j

2. Run large-scale FP4 GEMM to saturate Tensor Cores

The CUTLASS profiler is then used to execute a GEMM workload designed to fully utilize FP4 Tensor Core throughput:

Input matrices A and B use NVFP4 (block-scaled FP4) format
Accumulation is performed in FP32
Matrix dimensions are chosen to be sufficiently large (at least 8192, typically 16384) to avoid memory bottlenecks and maximize compute utilization
Multiple warm-up iterations are used before timed runs

Example profiler invocation:

./tools/profiler/cutlass_profiler \
  --operation=gemm \
  --A=nvfp4 --B=nvfp4 --C=f32 \
  --accum=f32 \
  --m=16384 --n=16384 --k=16384 \
  --warmup=10 --iterations=50

3. Performance evaluation

The achieved FLOP/s reported by the CUTLASS profiler is used to compute the effective FP4 Tensor Core throughput, which is then compared against the advertised ~1 PFLOPS FP4 peak performance for DGX Spark (GB10).

Based on this setup, I would like to ask:

Does this CUTLASS profiler–based methodology align with how NVIDIA characterizes or validates the FP4 peak performance of the GB10 Grace Blackwell Superchip?
Are there specific CUTLASS kernels, profiler options, or GEMM shapes that NVIDIA recommends when evaluating FP4 performance on GB10?
Alternatively, are there other NVIDIA-provided benchmarks or reference measurements that are better suited for reproducing the FP4 performance numbers published on the DGX Spark product page?

Any clarification or official guidance would be greatly appreciated.

Thank you very much for your time and support.

aniculescu · January 14, 2026, 7:28pm

If you want to validate DGX Spark performance, you may want to reference the inference section of this post which details performance of specific models with different backends

Turtle7777 · January 15, 2026, 9:10am

Hi @aniculescu
I have verified the parts you mentioned. However, I would like to have a setup similar to Jetson Thor, where I can benchmark the performance of corresponding data types using CUTLASS.

Could you please advise how to rebuild such an environment and what verification or benchmarking commands should be used? Thank you.

Verifying claimed TOPS performance on Jetson Thor – CUTLASS kernel for SM110 does not run, SM80 gives very low performance (~6.9 TFLOP/s) - Jetson Systems / Jetson Thor - NVIDIA Developer Forums

Topic		Replies	Views
1PFLOP - how? DGX Spark / GB10	2	573	April 25, 2026
How to benchmark on Thor to get the real FP4/FP8 performance TFOPS Jetson Thor nvbugs , benchmarks	10	507	March 16, 2026
【Jetson Thor】Cutlass FP4/FP8/FP16 Performance Test Jetson Thor cuda	14	177	June 15, 2026
Is there any official benchmark tool to test a GPU's FLOPS? GPU-Accelerated Libraries cublas , cutlass	3	7159	October 24, 2023
Detailed Compute Performance Metrics for DGX Spark DGX Spark / GB10	6	3511	November 25, 2025
Verifying claimed TOPS performance on Jetson Thor – CUTLASS kernel for SM110 does not run, SM80 gives very low performance (~6.9 TFLOP/s) Jetson Thor cudnn , cublas	21	876	January 5, 2026
Dearest CUTLASS TEAM, When the hell are you going to properly fix tcgen05 FP4 support for DGX Spark / GB10 (SM121)? DGX Spark / GB10	37	2392	April 25, 2026
SM121 (GB10) native NVFP4 compute — seeking guidance on software support DGX Spark / GB10 cuda , kernel , nemotron	3	904	March 25, 2026
FP4 on DGX Spark — Why It Doesn't Scale Like You'd Expect DGX Spark / GB10	213	6591	March 13, 2026
SM121 CUTLASS Kernel Optimization Results: NVFP4 356 TFLOPS, MoE Grouped GEMM on DGX Spark DGX Spark / GB10	9	979	February 9, 2026

Question on Reproducing DGX Spark (GB10) FP4 1 PFLOPS Performance Using CUTLASS Profiler

Related topics