Detailed Compute Performance Metrics for DGX Spark

elijah.meise · November 19, 2025, 9:45pm

I am looking for more technical details about the DGX Spark system, particularly its achievable compute performance such as:

– Peak FP32 TFLOPS (non-Tensor)
– Peak TF32 Tensor TFLOPS with FP32 accumulate (with and without sparsity)
– Peak BF16 Tensor TFLOPS with FP32 accumulate (with and without sparsity)
– Peak FP16 Tensor TFLOPS with FP32 accumulate (with and without sparsity)

NVIDIA has stated that the system can deliver around 1000 FP4 Tensor TFLOPS with sparsity, but I have not been able to find documentation or technical papers that list the performance figures above. So any pointers would be greatly appreciated.

PS: Alternatively, it would be helpful if someone with access to the hardware could run a quick test, for example, using GitHub - ReinForce-II/mmapeak . The output of deviceQuery would be informative too.

aniculescu · November 19, 2025, 9:49pm

Please check out this post about DGX Spark performance: How NVIDIA DGX Spark’s Performance Enables Intensive AI Tasks | NVIDIA Technical Blog

elijah.meise · November 21, 2025, 9:57pm

Thank you for sharing the link – the information is much appreciated.

Unfortunately, this blog post does not address the performance metrics I mentioned earlier. We are evaluating the DGX Spark as a development platform not only for large language models and image generation, but also for proprietary CUDA workloads that combine AI and HPC (e.g., hybrid AI-physics models for robotics). Our plan is to prototype on the DGX Spark before training and deploying models on DGX-class systems.

With that in mind, it would be very helpful to know not only the memory bandwidth (documented as 273 GB/s), but also the peak performance of the non-Tensor cores (FP32) and the Tensor cores (TF32, FP16, BF16) with FP32 accumulate. This information would help us better estimate the performance we can expect for our intended workloads.

Again, many thanks in advance for any details you can share.

alan.dang · November 23, 2025, 3:16pm

Running inside a Docker
Device 0: NVIDIA GB10
Compute capability: 12.1
Total global memory: 119.7 GiB
Multiprocessor count: 48
Running benchmarks with target time: 3.0 seconds
mma_s4s4s32_8_8_32
run: 2999.5 ms 26.9 T(fl)ops
mma_mxf4mxf4f32_16_8_64
run: 2998.7 ms 427.3 T(fl)ops
mma_nvf4nvf4f32_16_8_64
run: 2999.6 ms 427.3 T(fl)ops
mma_f4f4f16_16_8_32
run: 2999.2 ms 213.8 T(fl)ops
mma_f4f4f32_16_8_32
run: 2999.3 ms 213.7 T(fl)ops
mma_f6f6f16_16_8_32
run: 2999.2 ms 213.9 T(fl)ops
mma_f6f6f32_16_8_32
run: 2999.4 ms 213.7 T(fl)ops
mma_mxf6mxf6f32_16_8_32
run: 2996.3 ms 213.7 T(fl)ops
mma_mxf8mxf8f32_16_8_32
run: 2999.3 ms 213.7 T(fl)ops
mma_f8f8f16_16_8_32
run: 3000.0 ms 213.8 T(fl)ops
mma_f8f8f32_16_8_32
run: 2999.4 ms 213.7 T(fl)ops
mma_s8s8s32_16_16_16
run: 2997.7 ms 215.1 T(fl)ops
mma_s8s8s32_32_8_16
run: 3000.3 ms 215.1 T(fl)ops
mma_f16f16f16_16_16_16
run: 2996.8 ms 213.0 T(fl)ops
mma_f16f16f16_32_8_16
run: 2998.7 ms 213.1 T(fl)ops
mma_f16f16f32_16_16_16
run: 2997.6 ms 212.9 T(fl)ops
mma_f16f16f32_32_8_16
run: 2999.9 ms 212.9 T(fl)ops
mma_bf16bf16f32_16_16_16
run: 3000.3 ms 212.9 T(fl)ops
mma_bf16bf16f32_32_8_16
run: 3000.7 ms 212.9 T(fl)ops
mma_tf32tf32f32_16_16_8
run: 2997.9 ms 53.3 T(fl)ops

raphael.amorim · November 23, 2025, 10:05pm

The numbers for:
mma_mxf4mxf4f32_16_8_64

mma_nvf4nvf4f32_16_8_64

need to the reviewed because they do not reflect how NVFP4 is processed, there’s a multiplying facto.

alan.dang · November 24, 2025, 8:24am

That is just what the output of the script that is given by the OP when I copy and paste it directly. I basically pull the cuda13 container, compile and pasted the results back in.

The 1000 AI TOPS maybe be a combination of all the various cores working together versus a difference in theoretical vs true.

I used this AI generated script based upon my preferences and need for robust mirror retrying. I didn’t check the original code provided by the OP or tweak the benchmark settings

[code]
#!/usr/bin/env bash
set -euo pipefail

IMAGE=“nvidia/cuda:13.0.2-devel-ubuntu24.04”
CONTAINER_NAME=“mmapeak-cuda1302”

Optional: sanity check

if ! command -v docker >/dev/null 2>&1; then
echo “ERROR: docker not found in PATH.” >&2
exit 1
fi

echo “>>> Pulling CUDA image: ${IMAGE}”
docker pull “${IMAGE}”

echo “>>> Running container, building mmapeak, and executing it…”

NOTE: no -t here, only -i, because we are piping a here-doc into stdin.

docker run --rm -i
–gpus all
–name “${CONTAINER_NAME}”
“${IMAGE}”
bash -s <<‘IN_CONTAINER’
set -euo pipefail
export DEBIAN_FRONTEND=noninteractive

— APT helper functions with retries + Berkeley OCF fallback —

switch_to_berkeley_mirror() {
local ocf_ports=‘OCF Mirrors’
echo “>>> Switching apt sources to Berkeley OCF ubuntu-ports mirror: ${ocf_ports}”

if [ -f /etc/apt/sources.list ]; then
sed -i
-e ‘s|http://ports.ubuntu.com/ubuntu-ports|‘“${ocf_ports}”’|g’
-e ‘s|https://ports.ubuntu.com/ubuntu-ports|‘“${ocf_ports}”’|g’
/etc/apt/sources.list || true
fi

if ls /etc/apt/sources.list.d/.list >/dev/null 2>&1; then
sed -i
-e ‘s|http://ports.ubuntu.com/ubuntu-ports|‘“${ocf_ports}”’|g’
-e ‘s|https://ports.ubuntu.com/ubuntu-ports|‘“${ocf_ports}”’|g’
/etc/apt/sources.list.d/.list || true
fi

if ls /etc/apt/sources.list.d/.sources >/dev/null 2>&1; then
sed -i
-e ‘s|http://ports.ubuntu.com/ubuntu-ports|‘“${ocf_ports}”’|g’
-e ‘s|https://ports.ubuntu.com/ubuntu-ports|‘“${ocf_ports}”’|g’
/etc/apt/sources.list.d/.sources || true
fi
}

apt_update_with_retry() {
local max_tries=3
local try

for try in $(seq 1 “${max_tries}”); do
echo “>>> apt-get update (attempt ${try}/${max_tries}, default mirrors)…”
if apt-get update; then
return 0
fi
echo “>>> apt-get update failed (attempt ${try}), retrying in 5s…”
sleep 5
done

echo “>>> apt-get update failed with default mirrors. Switching to Berkeley OCF and retrying…”
switch_to_berkeley_mirror

for try in $(seq 1 “${max_tries}”); do
echo “>>> apt-get update (Berkeley mirror, attempt ${try}/${max_tries})…”
if apt-get update; then
return 0
fi
echo “>>> apt-get update (Berkeley) failed (attempt ${try}), retrying in 5s…”
sleep 5
done

echo “>>> ERROR: apt-get update failed even after switching to Berkeley mirror.” >&2
return 1
}

apt_install_with_retry() {
local pkgs=(“$@”)
local max_tries=3
local try

for try in $(seq 1 “${max_tries}”); do
echo “>>> apt-get install (attempt ${try}/${max_tries}, default mirrors): ${pkgs[*]}”
if apt-get install -y --no-install-recommends “${pkgs[@]}”; then
return 0
fi
echo “>>> apt-get install failed (attempt ${try}), retrying in 5s…”
sleep 5
done

echo “>>> apt-get install failed with default mirrors. Switching to Berkeley OCF and retrying…”
switch_to_berkeley_mirror
apt-get update

for try in $(seq 1 “${max_tries}”); do
echo “>>> apt-get install (Berkeley mirror, attempt ${try}/${max_tries}): ${pkgs[*]}”
if apt-get install -y --no-install-recommends “${pkgs[@]}”; then
return 0
fi
echo “>>> apt-get install (Berkeley) failed (attempt ${try}), retrying in 5s…”
sleep 5
done

echo “>>> ERROR: apt-get install failed even after switching to Berkeley mirror.” >&2
return 1
}

— APT setup & dependencies —

echo “>>> Updating package lists with retry/fallback…”
apt_update_with_retry

echo “>>> Installing build dependencies…”
apt_install_with_retry git build-essential cmake ca-certificates

— Clone, build, and run mmapeak —

echo “>>> Cloning mmapeak…”
git clone GitHub - ReinForce-II/mmapeak /opt/mmapeak

echo “>>> Building mmapeak with CMake…”
cd /opt/mmapeak
mkdir -p build
cd build
cmake -DCMAKE_BUILD_TYPE=Release ..
make -j"$(nproc)"

echo “>>> Running ./mmapeak …”
./mmapeak
IN_CONTAINE[/code]

elijah.meise · November 25, 2025, 10:15pm

@alan.dang Thanks for running those raw compute performance measurements – much appreciated.

The results suggest that peak TF32 Tensor Core performance is twice that of FP32 on the non-Tensor/vector units, similar to the former Quadro (now RTX PRO) line and unlike the consumer RTX cards. That is encouraging. The same pattern appears to hold for FP16, BF16 and FP8 with FP32 accumulate.

Regarding the FP4 Tensor Core numbers: the ~500 TFLOPS you measured align with the expected non-sparsity peak performance. Nvidia’s advertised 1000 TFLOPS refer to the FP4 sparsity peak Tensor Core performance.

It is also worth keeping in mind that peak performance specifications typically assume boost clocks that can only be sustained briefly. Overall, the compute performance is impressive for such a small form factor (though the system would have been truly exceptional with a higher memory bandwidth).

All of this, of course, would benefit from confirmation. A dedicated Nvidia white paper on the GB10 GPU, similar to the RTX and RTX PRO documents linked below, would remove the guesswork.

– https://images.nvidia.com/aem-dam/Solutions/geforce/blackwell/nvidia-rtx-blackwell-gpu-architecture.pdf
– https://www.nvidia.com/content/dam/en-zz/Solutions/design-visualization/quadro-product-literature/pdf/NVIDIA-RTX-Blackwell-PRO-GPU-Architecture-v1_1.pdf

Topic		Replies	Views
How NVIDIA DGX Spark's Performance Enables Intensive AI Tasks Technical Blog	2	204	October 27, 2025
MSI EdgeXpert vs DGX Spark DGX Spark / GB10 performance	2	646	November 26, 2025
Comparing AI Performance of DGX Spark to Jetson Thor DGX Spark / GB10	6	9191	September 5, 2025
Dgx spark benchmark performance DGX Spark / GB10	17	1467	January 4, 2026
How NVIDIA DGX Spark’s Performance Enables Intensive AI Tasks DGX Spark / GB10	2	808	December 12, 2025
DGX Spark FP64 performance? DGX Spark / GB10	8	1453	October 30, 2025
Question on Inference Performance Results of Qwen3 235B A22B on 2× DGX Spark DGX Spark / GB10 cuda	5	329	December 19, 2025
DGX Spark Power Consumption DGX Spark / GB10	2	291	November 3, 2025
Question on Reproducing DGX Spark (GB10) FP4 1 PFLOPS Performance Using CUTLASS Profiler DGX Spark / GB10 cuda	2	127	January 15, 2026
DGX Spark (SM121) Software Support is Severely Lacking - Official Roadmap Needed DGX Spark / GB10	25	1182	January 22, 2026