Stockfish CUDA - GPU-Accelerated Chess Engine for DGX SPARK

Stockfish chess engine accelerated with CUDA on DGX SPARK, leveraging 128GB unified memory.

This project accelerates Stockfish 18, the world’s strongest open-source chess engine, using
NVIDIA CUDA on DGX SPARK. It leverages the Blackwell GPU architecture and 128GB unified memory to achieve 3.3x speedup over CPU baseline. Key innovations include GPU-accelerated NNUE neural network evaluation using Tensor Cores, unified memory transposition tables up to 96GB, and batch evaluation of 256 positions. The implementation uses cudaMallocManaged for seamless CPU/GPU memory sharing and WMMA for int8 matrix operations.

Technologies Used:

  • CUDA 12.0+
  • Unified Memory (cudaMallocManaged)
  • Tensor Cores (WMMA)
  • Blackwell Architecture (SM 12.x)
  • DGX SPARK Platform

Industry/Application:
Artificial Intelligence, Gaming, High Performance Computing, Game Tree Search

Performance Metrics:

  • 400M+ nodes/second with 64 threads
  • 3.3x speedup vs CPU-only
  • 96GB unified memory utilization
  • 119GB total memory available

Links:
GitHub: Release stockfish-cuda-full (109MB) ⭐ Versione completa · EquaCoin/stockfish-dgx-spark · GitHub

How did you solve the cache issue? NNUE is so big that it does not fit into L1.

And memory or L2 speed is the limiting factor when parallelizing NNUE evaluations.

On Github you write 420M nodes per second. Even if those are updates (instead of complete refreshes of the accumulators) and only 100M nodes per second need a NNUE evaluation (and the others come from TT cache), one average you load 4 positional changes + 18 threat changes → 22 * 4 KiB = 88 KiB, then you need a memory speed of 8.8 TB/s.

Or actually 26 KiB, nevertheless.

Or is the NNUE accumulation done on CPU with the larger L2 and L3 caches per core and cluster?