A new GPU-accelerated prime sieve using constant-cost structural elimination to overcome memory bandwidth limits at massive scales

Hi everyone,

I’ve developed a new hybrid GPU–CPU prime sieving framework called DAWS (Dual-Wheel Adaptive Wave Sieve).

Traditional GPU-based sieves often hit performance barriers due to memory bandwidth and host-device transfer overhead as $N$ scales. DAWS addresses this by shifting the paradigm from “prime generation” to “structural elimination” within a linear index space.

Key Technical Highlights:

  • Class S (Structural Elimination): Leverages dual-wheel binary masks ($W_1, W_2$) to eliminate a large fraction of composite candidates at a constant GPU cycle cost, regardless of the magnitude (effective even for $N$ up to $10^{1000}$ and beyond).

  • Adaptive Cutoff Control: A dynamic workload partitioning mechanism that shifts remaining work between GPU bulk processing and CPU phase-jump traversal based on real-time candidate density (live ratio).

  • Zero-Copy Strategy: Surviving primes are recovered and compressed (delta–varint encoding) entirely on the GPU. Only the compressed byte stream is transferred to the host, approaching the information-theoretic minimum for data movement.

As $N$ increases, the dominant cost shifts, and DAWS begins to behave as an “Information Navigator.” It avoids unnecessary calculations by gliding through the “near-vacuum” of composite-free space.

I have published the full theoretical framework and complexity analysis on Zenodo. I would be very interested in hearing your thoughts on the dual-wheel implementation or discussing potential optimizations for large-scale sieving workloads.

Full Paper (Zenodo DOI): Dual-Wheel Adaptive Wave Sieve (DAWS): Structural Elimination and Adaptive Navigation in Large-Scale Prime Sieving

Looking forward to the discussion!

1 Like

Howdy,

I considered the very same issue many years ago, that is splitting a prime sieve job between CPU and GPU. What looks like a good idea in the first place, didn’t work out in reality as far as performance is concerned. I wrote a prime sieve program and after a lot of trial and error it turned out that the most performant set up is to do the complete sieving job on the GPU and control and UI on the CPU.

I’m looking forward to the implementation of your concept and its performance.

1 Like

Hi, thanks for sharing your experience — that matches what I’ve also seen with conventional CPU/GPU split sieves.

I agree that if the GPU is treated as a prime generator, keeping the full sieve on the GPU is usually optimal.

DAWS tries to shift the split point earlier: the GPU performs only constant-cost structural elimination while candidate density is high, and once the density collapses, the remaining phase is no longer “sieving” in the traditional sense but sparse navigation / phase jumps, which tend to map better to CPU control flow.

I’m currently working toward a minimal implementation to test whether this regime change actually appears in practice. I’m very interested to see whether your past conclusions still hold under this different assumption.

One clarification: DAWS is intentionally written as a design-level framework rather than a hand-tuned implementation. The intended workflow is to let an AI read the paper and generate environment-specific kernels (GPU-only, hybrid, or phase-switched) based on the actual hardware and density regime.

In that sense, the CPU/GPU boundary is not fixed by the algorithm, but learned/derived per deployment. I’m also curious whether your past conclusions would differ under this assumption.

I would consider different GPU approaches, which you combine for the regimes.

E.g. one can store booleans in memory vs. the actual prime candidates as list. There are GPU approaches for both cases.

1 Like

Exactly — that’s very much in line with the DAWS view.

The idea is to switch representation with density: bitmask-style elimination while the space is dense, and sparse candidate lists once the regime collapses. One open question is how far this switching can stay GPU-resident before control flow overhead dominates. Thanks for pointing this out.