how to generate n unique random #s or swap atomically

I’m still working on my selection algorithm. I’m trying to choose 16 random samples from an array and swap them with the last 16 elements in the array. There’re 2 ways I see how to do it. But each has its challenge.

  1. Choose 16 unique random #s from 0 to N - 17. Then thread k swaps a random element with element N - 1 - k. None of the swaps share a memory location, hence the swaps don’t need to be atomic. This might result in a biased sample since the last 16 elements have 0 probabilty of being exchanged with another of the last 16.

The main issue is how to generate M unique random ints from 0 to N - 1 without doing rejection sampling. I know 2 unique random ints can be generated efficienty by this:

e0 = rand() % N
e1 = (e0 + 1 + rand() % (N - 1)) % N

This method can be visualized easily, but I can’t think how to extend it to more than 2 random #s.

  1. Choose 16 random #s (not necessarily unique) from 0 to N - 1. Then thread k swaps its random element with element N - 1 - k. Due to overlaps, the swaps have to be atomic.
    I don’t know how to do atomic swaps of memory locations. I was thinking of something like:

temp_a = atomicExc(a, garbage);
temp_b = atomicExch(b, temp_a);
garbage = atomicExch(a, temp_b);

Despite other threads reading garbage, it miraculously works in the end, at least for simple cases. Anyone know how to swap memory atomically without mutual exclusion?

In CUDA, the integer modulus operation is very expensive… much more so than the simple LCG that rand() uses. So be careful about saying even your example of two rands is “efficient.”
If you know N ahead of time you can do a lot of games… like if N is 1000, look at the top 10 bits of the rand (which gives a random number from 0 to 1023) and accepting them if they’re < 1000, and trying again with a new rand() if not.

As for your actual problem, you can model the selection repeatedly selecting from an integer hypergeometric distribution.
The hypergeometric sampling is efficient if N is large compared to the number of samples you draw. It’s a serial sampling algorithm unfortunately.

If the ratio of samples (S) /N is higher than say 0.1, then it’s easier to stream down the list linearly. Accept the first value with probability S/N. Decrement N. If the last sample was accepted, also decrement S. Take the next sample with probability S/N. Repeat until S=0. This gives unbiased selection.
It’s also completely serial, unfortunately.

But if you’re doing this for a parallel selection pivot, it’s possible you don’t need to make every sample independent.
Instead, it is both easier and arguably more useful to do a kind of stratified sampling. Imagine that you need 16 random numbers (no duplicates) from 0 to 255. One way to do it which is easy easy is to pick one value from 0 to 15, one from 16 to 32, one from 33 to 48, and so on.
The math is easier, the parallelism is easier (no dependence on previous samples), and the coding is simpler.
The bias introduced may not matter for your selection partition step… if it does, then use the real sampling methods above.

Right, good call. I don’t need to choose all possible sets of samples with same probability. I see that you’re into physically based rendering. I remember from a course that stratified sampling is the preferred method (for PBRT) for choosing samples for each output pixel when ray tracing to reduce varriance from undersampling. I guess the unintended benefit would be easier ||ization since each pixel has equal # samples.

I’ve tried it by taking 64 samples, sorting them, and choosing elements 27, 32, 37 as my pivots, which results in 6 stages of the selection algorithm for 2^19 elements. This is better than the expected ceil(log4(2^19)) = 10 because I choose the partitions closer to where the median will be, hence discarding more unlikely elements.

Right, but I didn’t intend to do it on the GPU. When I said efficient, it was only 1.25x faster on a CPU compared to rejection sampling when the chance of collision was minimal. The code I mentioned has quite a long dependence chain, which will indeed be slow due to the ~24 cycle delay between dependent instructions.