This is MRG32k3a in particular, from Pierre L’Ecuyer, “Good Parameters and Implementations for Combined Multiple Recursive Random Number Generators,” Operations Research 47(1):159-164. The original code in the paper specifies:
p1 = a12 * s11 - a13n * s10;
k = p1 / m1;
p1 -= k * m1;
which is basically the modulo computation OP is inquiring about.
The key observation here is that m1 = 4294967087 = 0xFFFFFF2F = (0x100000000 - 209). It’s been too long ago that I worked with MRG32k3a to reliably connect the rest of the dots off the top of my head. I think the basic idea is that
m1 = 232 - x
t0 = (y div 232) * x
t1 = y mod 232
t2 = if (t1 ≥ (m1 - t0)) then m1; else 0
y mod m1 = t0 + t1 - t2
where here we have x = 209. Voila, modulo computation without expensive division and using just one 32-bit multiplication. This works for m1, t0, t1, t2 all unsigned 32-bit integers and unsigned 64-bit y in [0, 0x013991c1ffffffcc]. Since this is a modulo computation, y ≥ 0 is easily assured despite the subtraction in the original code by adding a sufficiently large multiple of m1, which is what we see reflected in the code posted by OP. Depending on specific constraints, further simplifications might apply.
BTW, I wonder where you found this code, because the source code of CURAND is not publicly available to my knowledge. Is this reverse engineered from machine code disassembly?
Modulo by a constant (integers, all positive) can be replaced by a division operation, a multiplication, and a subtraction, as njuffa has already shown.
The “difficult” part of that is the division, but when dividing by a constant, there are optimizations possible. As njuffa has shown, if you replace the division by a constant with a multiplication by reciprocal, and choose the numerator and denominator of that reciprocal such that the denominator is a power of 2, then the division operation can be replaced by multiplication plus shift. To make this work, a carefully chosen approximation to the reciprocal may be necessary.
@Robert_Crovella The code provided by the OP suggests that this isn’t using the standard “division by multiplication with the reciprocal” trick. Because p1 is a 64-bit integer, that approach may still not be super cheap on a 32-bit architecture, and would likely require at least three multiply instructions to compute the modulo.
But one can take advantage of the facts that [1] p1, by the algorithm construction, doesn’t actually use the full 64-bit range (if I recall correctly L’Ecuyer made sure it can be represented in a double) [2] the divisor is very close to a convenient power of two. For a 32-bit architecture like the GPU, div 232 and mod 232 applied to a 64-bit integer are basically register moves (which may subsequently be optimized out).
I didn’t realize the code was in a publicly visible header file. I helped the CURAND team with some optimizations including those for MRG32k3a, but it was too long ago to remember much about it. I may or may not have had something to do with this optimization. The basic underlying idea has been around in programming circles since at least the mid-1980s (in hashing algorithms and public-key cryptography, if I am not mistaken). Since computer science people are particularly forgetful about the history of their craft, probably way before that.
What I outlined above doesn’t work for arbitrary 64-bit input y because eventually one or several of the intermediate 32-bit variables are going to overflow. Very roughly, we would expect this to happen around 264 / x. The use of 32-bit intermediates is what makes this so useful for a 32-bit architecture like the GPU. In the modified version of the MRG32k3a algorithm shown in your code, p1 does not exceed 0x0021c9a3e454f660 since s1[1] can be at most m1 - 1 and s1[0] is at minimum 0, so the limitation on the magnitude of y is not a problem.
[Later:]
Actually, comparing curand_MRG32k3a in curandkernel.h through a few CUDA versions, this seems to be a fairly recent optimization, so I definitely had no involvement. The version I see in CUDA 11 has a simpler-looking correction step at the end compared to what I showed above. Whether it is actually is faster is unclear because it uses 64-bit arithmetic which is emulated on the GPU.