Modulo calculation without %

Hi community,
I’ve found following code in the cuRAND Library (MRG32 generator) that tries to avoid heavy % operation. It something like:

p1 = a12*s1[1] + a13n*(M1 - s1[0]);
p1 = (p1 & 0xffffffff) + (p1 >> 32) * 209;

Could you please help to understand what is the mathematical proof here? How it correlates to the theory?
I’d expect something like

p1 = (a12*s1[1] - a13n*s1[0]) % M1;

This is MRG32k3a in particular, from Pierre L’Ecuyer, “Good Parameters and Implementations for Combined Multiple Recursive Random Number Generators,” Operations Research 47(1):159-164. The original code in the paper specifies:

p1 = a12 * s11 - a13n * s10;
k = p1 / m1;
p1 -= k * m1;

which is basically the modulo computation OP is inquiring about.

The key observation here is that m1 = 4294967087 = 0xFFFFFF2F = (0x100000000 - 209). It’s been too long ago that I worked with MRG32k3a to reliably connect the rest of the dots off the top of my head. I think the basic idea is that

m1 = 232 - x
t0 = (y div 232) * x
t1 = y mod 232
t2 = if (t1 ≥ (m1 - t0)) then m1; else 0
y mod m1 = t0 + t1 - t2

where here we have x = 209. Voila, modulo computation without expensive division and using just one 32-bit multiplication. This works for m1, t0, t1, t2 all unsigned 32-bit integers and unsigned 64-bit y in [0, 0x013991c1ffffffcc]. Since this is a modulo computation, y ≥ 0 is easily assured despite the subtraction in the original code by adding a sufficiently large multiple of m1, which is what we see reflected in the code posted by OP. Depending on specific constraints, further simplifications might apply.

BTW, I wonder where you found this code, because the source code of CURAND is not publicly available to my knowledge. Is this reverse engineered from machine code disassembly?

1 Like

Just adding a few breadcrumbs.

Modulo by a constant (integers, all positive) can be replaced by a division operation, a multiplication, and a subtraction, as njuffa has already shown.

The “difficult” part of that is the division, but when dividing by a constant, there are optimizations possible. As njuffa has shown, if you replace the division by a constant with a multiplication by reciprocal, and choose the numerator and denominator of that reciprocal such that the denominator is a power of 2, then the division operation can be replaced by multiplication plus shift. To make this work, a carefully chosen approximation to the reciprocal may be necessary.

@Robert_Crovella The code provided by the OP suggests that this isn’t using the standard “division by multiplication with the reciprocal” trick. Because p1 is a 64-bit integer, that approach may still not be super cheap on a 32-bit architecture, and would likely require at least three multiply instructions to compute the modulo.

But one can take advantage of the facts that [1] p1, by the algorithm construction, doesn’t actually use the full 64-bit range (if I recall correctly L’Ecuyer made sure it can be represented in a double) [2] the divisor is very close to a convenient power of two. For a 32-bit architecture like the GPU, div 232 and mod 232 applied to a 64-bit integer are basically register moves (which may subsequently be optimized out).

Hi @njuffa,
Thanks for your comments! There were pretty helpful to understand. One thing I couldn’t realize is why y in [0, 0x013991c1ffffffcc]?

Regarding the code. I’ve seen it recently at cuRAND header files, it wasn’t in machine code, just in C++ headers. And I said “This is pretty interesting approach to get rid of heavy % operation. I’m curious how to mathematically prove it”. cuRAND is freely available as part of HPC SDK (cuRAND | NVIDIA Developer).
The same approach is in rocRAND MRG32 implementation (rocRAND/rocrand_mrg32k3a.h at develop · ROCmSoftwarePlatform/rocRAND · GitHub)

I didn’t realize the code was in a publicly visible header file. I helped the CURAND team with some optimizations including those for MRG32k3a, but it was too long ago to remember much about it. I may or may not have had something to do with this optimization. The basic underlying idea has been around in programming circles since at least the mid-1980s (in hashing algorithms and public-key cryptography, if I am not mistaken). Since computer science people are particularly forgetful about the history of their craft, probably way before that.

What I outlined above doesn’t work for arbitrary 64-bit input y because eventually one or several of the intermediate 32-bit variables are going to overflow. Very roughly, we would expect this to happen around 264 / x. The use of 32-bit intermediates is what makes this so useful for a 32-bit architecture like the GPU. In the modified version of the MRG32k3a algorithm shown in your code, p1 does not exceed 0x0021c9a3e454f660 since s1[1] can be at most m1 - 1 and s1[0] is at minimum 0, so the limitation on the magnitude of y is not a problem.

[Later:]

Actually, comparing curand_MRG32k3a in curandkernel.h through a few CUDA versions, this seems to be a fairly recent optimization, so I definitely had no involvement. The version I see in CUDA 11 has a simpler-looking correction step at the end compared to what I showed above. Whether it is actually is faster is unclear because it uses 64-bit arithmetic which is emulated on the GPU.

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.