Question on Reproducing DGX Spark (GB10) FP4 1 PFLOPS Performance Using CUTLASS Profiler

Dear NVIDIA Team,
I am currently evaluating the FP4 AI performance claim stated on the NVIDIA DGX Spark (GB10) product page, which indicates that the system can deliver up to one petaFLOP of FP4 AI performance.

To better understand and validate this claim, I would like to confirm whether the following CUTLASS-based methodology is an appropriate and NVIDIA-aligned way to reproduce or approximate the advertised FP4 peak performance, or whether NVIDIA recommends a different or more official approach.

Below is the specific approach I am currently using.


Method 1 (Primary approach): CUTLASS Profiler

1. Build CUTLASS with GB10 (SM121) support

CUTLASS is built in Release mode with explicit support for the GB10 compute capability and the full kernel library enabled:

cmake .. \
  -DCUTLASS_NVCC_ARCHS=121 \
  -DCUTLASS_LIBRARY_KERNELS=all \
  -DCMAKE_BUILD_TYPE=Release
make -j

2. Run large-scale FP4 GEMM to saturate Tensor Cores

The CUTLASS profiler is then used to execute a GEMM workload designed to fully utilize FP4 Tensor Core throughput:

  • Input matrices A and B use NVFP4 (block-scaled FP4) format
  • Accumulation is performed in FP32
  • Matrix dimensions are chosen to be sufficiently large (at least 8192, typically 16384) to avoid memory bottlenecks and maximize compute utilization
  • Multiple warm-up iterations are used before timed runs

Example profiler invocation:

./tools/profiler/cutlass_profiler \
  --operation=gemm \
  --A=nvfp4 --B=nvfp4 --C=f32 \
  --accum=f32 \
  --m=16384 --n=16384 --k=16384 \
  --warmup=10 --iterations=50

3. Performance evaluation

The achieved FLOP/s reported by the CUTLASS profiler is used to compute the effective FP4 Tensor Core throughput, which is then compared against the advertised ~1 PFLOPS FP4 peak performance for DGX Spark (GB10).


Based on this setup, I would like to ask:

  • Does this CUTLASS profiler–based methodology align with how NVIDIA characterizes or validates the FP4 peak performance of the GB10 Grace Blackwell Superchip?
  • Are there specific CUTLASS kernels, profiler options, or GEMM shapes that NVIDIA recommends when evaluating FP4 performance on GB10?
  • Alternatively, are there other NVIDIA-provided benchmarks or reference measurements that are better suited for reproducing the FP4 performance numbers published on the DGX Spark product page?

Any clarification or official guidance would be greatly appreciated.

Thank you very much for your time and support.

If you want to validate DGX Spark performance, you may want to reference the inference section of this post which details performance of specific models with different backends

Hi @aniculescu
I have verified the parts you mentioned. However, I would like to have a setup similar to Jetson Thor, where I can benchmark the performance of corresponding data types using CUTLASS.

Could you please advise how to rebuild such an environment and what verification or benchmarking commands should be used? Thank you.

Verifying claimed TOPS performance on Jetson Thor – CUTLASS kernel for SM110 does not run, SM80 gives very low performance (~6.9 TFLOP/s) - Jetson Systems / Jetson Thor - NVIDIA Developer Forums