Detailed Compute Performance Metrics for DGX Spark

I am looking for more technical details about the DGX Spark system, particularly its achievable compute performance such as:

– Peak FP32 TFLOPS (non-Tensor)
– Peak TF32 Tensor TFLOPS with FP32 accumulate (with and without sparsity)
– Peak BF16 Tensor TFLOPS with FP32 accumulate (with and without sparsity)
– Peak FP16 Tensor TFLOPS with FP32 accumulate (with and without sparsity)

NVIDIA has stated that the system can deliver around 1000 FP4 Tensor TFLOPS with sparsity, but I have not been able to find documentation or technical papers that list the performance figures above. So any pointers would be greatly appreciated.

PS: Alternatively, it would be helpful if someone with access to the hardware could run a quick test, for example, using GitHub - ReinForce-II/mmapeak . The output of deviceQuery would be informative too.

Please check out this post about DGX Spark performance: How NVIDIA DGX Spark’s Performance Enables Intensive AI Tasks | NVIDIA Technical Blog

Thank you for sharing the link – the information is much appreciated.

Unfortunately, this blog post does not address the performance metrics I mentioned earlier. We are evaluating the DGX Spark as a development platform not only for large language models and image generation, but also for proprietary CUDA workloads that combine AI and HPC (e.g., hybrid AI-physics models for robotics). Our plan is to prototype on the DGX Spark before training and deploying models on DGX-class systems.

With that in mind, it would be very helpful to know not only the memory bandwidth (documented as 273 GB/s), but also the peak performance of the non-Tensor cores (FP32) and the Tensor cores (TF32, FP16, BF16) with FP32 accumulate. This information would help us better estimate the performance we can expect for our intended workloads.

Again, many thanks in advance for any details you can share.

Running inside a Docker
Device 0: NVIDIA GB10
Compute capability: 12.1
Total global memory: 119.7 GiB
Multiprocessor count: 48
Running benchmarks with target time: 3.0 seconds
mma_s4s4s32_8_8_32
run: 2999.5 ms 26.9 T(fl)ops
mma_mxf4mxf4f32_16_8_64
run: 2998.7 ms 427.3 T(fl)ops
mma_nvf4nvf4f32_16_8_64
run: 2999.6 ms 427.3 T(fl)ops
mma_f4f4f16_16_8_32
run: 2999.2 ms 213.8 T(fl)ops
mma_f4f4f32_16_8_32
run: 2999.3 ms 213.7 T(fl)ops
mma_f6f6f16_16_8_32
run: 2999.2 ms 213.9 T(fl)ops
mma_f6f6f32_16_8_32
run: 2999.4 ms 213.7 T(fl)ops
mma_mxf6mxf6f32_16_8_32
run: 2996.3 ms 213.7 T(fl)ops
mma_mxf8mxf8f32_16_8_32
run: 2999.3 ms 213.7 T(fl)ops
mma_f8f8f16_16_8_32
run: 3000.0 ms 213.8 T(fl)ops
mma_f8f8f32_16_8_32
run: 2999.4 ms 213.7 T(fl)ops
mma_s8s8s32_16_16_16
run: 2997.7 ms 215.1 T(fl)ops
mma_s8s8s32_32_8_16
run: 3000.3 ms 215.1 T(fl)ops
mma_f16f16f16_16_16_16
run: 2996.8 ms 213.0 T(fl)ops
mma_f16f16f16_32_8_16
run: 2998.7 ms 213.1 T(fl)ops
mma_f16f16f32_16_16_16
run: 2997.6 ms 212.9 T(fl)ops
mma_f16f16f32_32_8_16
run: 2999.9 ms 212.9 T(fl)ops
mma_bf16bf16f32_16_16_16
run: 3000.3 ms 212.9 T(fl)ops
mma_bf16bf16f32_32_8_16
run: 3000.7 ms 212.9 T(fl)ops
mma_tf32tf32f32_16_16_8
run: 2997.9 ms 53.3 T(fl)ops

1 Like

The numbers for:
mma_mxf4mxf4f32_16_8_64

mma_nvf4nvf4f32_16_8_64

need to the reviewed because they do not reflect how NVFP4 is processed, there’s a multiplying facto.

That is just what the output of the script that is given by the OP when I copy and paste it directly. I basically pull the cuda13 container, compile and pasted the results back in.

The 1000 AI TOPS maybe be a combination of all the various cores working together versus a difference in theoretical vs true.

I used this AI generated script based upon my preferences and need for robust mirror retrying. I didn’t check the original code provided by the OP or tweak the benchmark settings

[code]
#!/usr/bin/env bash
set -euo pipefail

IMAGE=“nvidia/cuda:13.0.2-devel-ubuntu24.04”
CONTAINER_NAME=“mmapeak-cuda1302”

Optional: sanity check

if ! command -v docker >/dev/null 2>&1; then
echo “ERROR: docker not found in PATH.” >&2
exit 1
fi

echo “>>> Pulling CUDA image: ${IMAGE}”
docker pull “${IMAGE}”

echo “>>> Running container, building mmapeak, and executing it…”

NOTE: no -t here, only -i, because we are piping a here-doc into stdin.

docker run --rm -i
–gpus all
–name “${CONTAINER_NAME}”
“${IMAGE}”
bash -s <<‘IN_CONTAINER’
set -euo pipefail
export DEBIAN_FRONTEND=noninteractive

— APT helper functions with retries + Berkeley OCF fallback —

switch_to_berkeley_mirror() {
local ocf_ports=‘OCF Mirrors
echo “>>> Switching apt sources to Berkeley OCF ubuntu-ports mirror: ${ocf_ports}”

if [ -f /etc/apt/sources.list ]; then
sed -i
-e ‘s|http://ports.ubuntu.com/ubuntu-ports|‘“${ocf_ports}”’|g
-e ‘s|https://ports.ubuntu.com/ubuntu-ports|‘“${ocf_ports}”’|g
/etc/apt/sources.list || true
fi

if ls /etc/apt/sources.list.d/.list >/dev/null 2>&1; then
sed -i
-e ‘s|http://ports.ubuntu.com/ubuntu-ports|‘“${ocf_ports}”’|g
-e ‘s|https://ports.ubuntu.com/ubuntu-ports|‘“${ocf_ports}”’|g
/etc/apt/sources.list.d/
.list || true
fi

if ls /etc/apt/sources.list.d/.sources >/dev/null 2>&1; then
sed -i
-e ‘s|http://ports.ubuntu.com/ubuntu-ports|‘“${ocf_ports}”’|g
-e ‘s|https://ports.ubuntu.com/ubuntu-ports|‘“${ocf_ports}”’|g
/etc/apt/sources.list.d/
.sources || true
fi
}

apt_update_with_retry() {
local max_tries=3
local try

for try in $(seq 1 “${max_tries}”); do
echo “>>> apt-get update (attempt ${try}/${max_tries}, default mirrors)…”
if apt-get update; then
return 0
fi
echo “>>> apt-get update failed (attempt ${try}), retrying in 5s…”
sleep 5
done

echo “>>> apt-get update failed with default mirrors. Switching to Berkeley OCF and retrying…”
switch_to_berkeley_mirror

for try in $(seq 1 “${max_tries}”); do
echo “>>> apt-get update (Berkeley mirror, attempt ${try}/${max_tries})…”
if apt-get update; then
return 0
fi
echo “>>> apt-get update (Berkeley) failed (attempt ${try}), retrying in 5s…”
sleep 5
done

echo “>>> ERROR: apt-get update failed even after switching to Berkeley mirror.” >&2
return 1
}

apt_install_with_retry() {
local pkgs=(“$@”)
local max_tries=3
local try

for try in $(seq 1 “${max_tries}”); do
echo “>>> apt-get install (attempt ${try}/${max_tries}, default mirrors): ${pkgs[*]}”
if apt-get install -y --no-install-recommends “${pkgs[@]}”; then
return 0
fi
echo “>>> apt-get install failed (attempt ${try}), retrying in 5s…”
sleep 5
done

echo “>>> apt-get install failed with default mirrors. Switching to Berkeley OCF and retrying…”
switch_to_berkeley_mirror
apt-get update

for try in $(seq 1 “${max_tries}”); do
echo “>>> apt-get install (Berkeley mirror, attempt ${try}/${max_tries}): ${pkgs[*]}”
if apt-get install -y --no-install-recommends “${pkgs[@]}”; then
return 0
fi
echo “>>> apt-get install (Berkeley) failed (attempt ${try}), retrying in 5s…”
sleep 5
done

echo “>>> ERROR: apt-get install failed even after switching to Berkeley mirror.” >&2
return 1
}

— APT setup & dependencies —

echo “>>> Updating package lists with retry/fallback…”
apt_update_with_retry

echo “>>> Installing build dependencies…”
apt_install_with_retry git build-essential cmake ca-certificates

— Clone, build, and run mmapeak —

echo “>>> Cloning mmapeak…”
git clone GitHub - ReinForce-II/mmapeak /opt/mmapeak

echo “>>> Building mmapeak with CMake…”
cd /opt/mmapeak
mkdir -p build
cd build
cmake -DCMAKE_BUILD_TYPE=Release ..
make -j"$(nproc)"

echo “>>> Running ./mmapeak …”
./mmapeak
IN_CONTAINE[/code]

@alan.dang Thanks for running those raw compute performance measurements – much appreciated.

The results suggest that peak TF32 Tensor Core performance is twice that of FP32 on the non-Tensor/vector units, similar to the former Quadro (now RTX PRO) line and unlike the consumer RTX cards. That is encouraging. The same pattern appears to hold for FP16, BF16 and FP8 with FP32 accumulate.

Regarding the FP4 Tensor Core numbers: the ~500 TFLOPS you measured align with the expected non-sparsity peak performance. Nvidia’s advertised 1000 TFLOPS refer to the FP4 sparsity peak Tensor Core performance.

It is also worth keeping in mind that peak performance specifications typically assume boost clocks that can only be sustained briefly. Overall, the compute performance is impressive for such a small form factor (though the system would have been truly exceptional with a higher memory bandwidth).

All of this, of course, would benefit from confirmation. A dedicated Nvidia white paper on the GB10 GPU, similar to the RTX and RTX PRO documents linked below, would remove the guesswork.

https://images.nvidia.com/aem-dam/Solutions/geforce/blackwell/nvidia-rtx-blackwell-gpu-architecture.pdf
https://www.nvidia.com/content/dam/en-zz/Solutions/design-visualization/quadro-product-literature/pdf/NVIDIA-RTX-Blackwell-PRO-GPU-Architecture-v1_1.pdf

3 Likes