New parallel PRNG passing full BigCrush (160/160) on CUDA + Metal – seeking cuRAND technical feedback

Hi all,

I’ve been working on a new parallel pseudo-random number generator (PRNG) designed for high-performance Monte Carlo workloads on GPU and CPU, and I would greatly appreciate feedback from the cuRAND / HPC engineers.

Context / setup

- Generator name: `PRNG_MONTMORY_CTACM`

- Architectures tested:

• Apple Silicon GPU (Metal backend)

• NVIDIA A10 (CUDA backend – OVH cloud)

- Test suite: **TestU01 v1.2.3**

- Batteries: **SmallCrush, Crush, BigCrush**

- BigCrush configuration:

• 8192 threads × 16384 draws

• 160 / 160 tests passed

• No anomalies reported

• Fully reproducible runs across seeds

On both platforms (Metal and CUDA), the generator passes *all* SmallCrush / Crush / BigCrush tests and produces consistent behavior across architectures.

The design is:

- massively parallel,

- deterministic per lane,

- bit-for-bit reproducible across CPU / Metal / CUDA.

I’m *not* trying to replace cuRAND or share proprietary code here

My only goals are to:

1. Get expert feedback on whether this class of generator could be of interest as a future optional engine for cuRAND (or for GPU Monte Carlo workflows), and

2. Know if there is a recommended technical contact or process inside NVIDIA for discussing PRNG research.

I can provide (privately if needed):

- full BigCrush logs (CUDA + Metal),

- all SmallCrush / Crush logs,

- performance benchmarks vs Philox / XORWOW,

- a minimal reproducible harness.

If this is not the right place for such a topic, any redirection to the appropriate NVIDIA team or contact would be very helpful.

Thanks a lot in advance for your time and guidance.

Pascal