Array Shuffle in CUDA?

Hey all,

Right now I’m struggling with coming with an algorithm or any idea as to how randomly shuffle an array of known size.

__global__ void shuffle( int *a, int n ) {

	int idx = blockIdx.x * blockDim.x + threadIdx.x;

	if (idx >= n) {



	int random = n * BLAH; // I'm using a Mersenne Twister to generate a random number between 0 and 1

	int old = atomicExch( &a[random], a[idx] );

	a[idx] = old;


int main() {

	const int n = 1000;

	int numThreadsPerBlock = 512;

	int numBlocks = (int) ceil(n / numThreadsPerBlock)+1;


	dim3 dimGrid(numBlocks);

	dim3 dimBlock(numThreadsPerBlock);

	int a[n];

	for (int i = 0; i < n; i++) {

		a[i] = i;



	int *a_d;

	cudaMalloc( (void **) &a_d, n * sizeof(int));

	cudaMemcpy( a_d, a, n * sizeof(int), cudaMemcpyHostToDevice);


	shuffle <<< dimGrid, dimBlock, n*sizeof(int) >>> ( a_d, b_d, n );

	cudaMemcpy( a, a_d, n* sizeof(int), cudaMemcpyDeviceToHost);

	for (int i = 0; i < n; i++) {

		printf("%d\n", a[i]);



This, of course, does not work since in the end there will be duplicate values of a, since the same values for the random variable will inevitably be generated across at least 2 threads, creating a semi - race condition ( I think? ).

Are there any implementations anyone can think of? Or am I doing something wrong?


There’s tons of ways to do it, which one to use depends a lot on how many items are in your array. With N=1000 like your example, you’re never going to do better than a CPU, and even on the GPU you’ll likely not benefit from more than one block.

But that said, and assuming N is much larger, some strategies:

  1. Assign a PRN per array item, using that as a key. Sort them by key. Use the fast radix sorter from Thrust or CUPP or the SDK.

  2. Load say 1024 values into each block. Start with block 0 getting 0-1023, etc. Swap these values inside each block randomly.
    (This is a sub-problem to solve of course.). Finish the kernel, then start again but this time take a “transpose” slice through the data… ie, block 0 might read items 0-31, 10000-100031, 20000-2000031, etc. (you’d do this in binary and make sure to cover all ranges of course.) swap among those in shared memory (same subproblem routine) then write back. Repeat with a third kernel, using yet a different slicing of space. Depending on the array size, you’d get an effectively uniform permutation after 3 passes through the data, perhaps 4. (The number of needed iterations is logarithmic in array length).

  3. Compute a pseudorandom permutation function (a bijective hash function) mapping integers 0-N back to the same integers. Each thread reads its item, computes the new destination via the hash, and writes it into a second array. (Most efficient by far, and easiest to write if you know the math)

If this is a real app (not a programming exercise), I can help you more with #3… it’s something I do a lot of research in the context of fast PRNG generation in CUDA. (I am a Monte Carlo math geek)

Thanks for the reply! I think between #1 and #2, #1 should suffice and is probably the easier to accomplish compared to #2, just cuz it’s already implemented in the SDK.

I’m interested in #3, though. I’m on the new side to CUDA, so constructing a bijective hash function seems a little challenging. This project is for a research thing I’m doing for computational biology, so I would say efficiency is very important haha. How would I make the hash bijective? If I understand this definition correctly, making a hash function onto is trivial, but I can’t think of an efficient way to make it one-to-one. Help is needed! haha


EDIT: BTW, the reason I want to implement a CUDA array shuffle over a CPU-based one is not because of the efficiency of the shuffle, persay, but the time spent cudaMemcpy’ing takes too long for my liking.

Making one isn’t hard… making one that’s random enough is trickier.

Think of a bijective hash as a pseudorandom permutation of the integers.

The simplest useful hash for max size N is something like H(i) = (A*i+B) mod N where A is chosen so LCG(A,N)==1.

That’s not the best mixing hash, and it’s far from random, but as a trivial example you can see how it can transform a sequence to mix it up quite a bit.

For N=10, A=3, B=1, you get a sequence of 1 4 7 0 3 6 9 2 5 8 .

Good pseudorandom bijective hashes will do several mixes (using different techniques), often chaining one hash into another into another. This is not necessarily slow since each hash is often trivial.

So really simple hashes like H(i)=i^1 or H(i) = (i<7 ? (i+3)%7 : i) are useless by themselves but great when part of a chain to break up any mathematical groups. If N is fixed, you can pick very efficient mixes. If N is known only at compile time, you need to parameterize them. And if N is known only at runtime, you have to parameterize them and do it EFFICIENTLY.

Dear Steve,
I have implemented a hash(). It takes an unsigned 32 bit int.
The inputs tend to be sparse, which makes them near large
powers of two. I am trying to hash them into 0…396 (or 0 to 799)
It uses xor on the four bytes of the input to collapse this gives
a number o…255. The second part adds log2 of the input multiplied
by the desired range (397) divided by 32. Finally the sum is reduced
modulo 397. This works but seems to have a higher than expected overhead.

Log2 is given by nvcc’s logf(). Surprisingly nvcc’s log2 appears to be
more expensive than logf but then the whole hash() function’s performance
seems to vary a lot with slight changes to code. Perhaps allowing/disallowing
compiler optimisations.

While inelegant the sum of a term proportional to log2 and xoring 4 bytes
does give few hash clashes. I worried about divergence but CUDA_PROFILE_LOG
shows about 1% divergence per “instruction”. So I guess that is not the problem.

Comments and thoughts most welcome

Your discussion doesn’t seem to relate to the original permutation topic.

I don’t understand what you’re trying to do… you’re trying to build a perfect hash of your sparse data, such that each value maps to one value from 0-397 with no holes?

Well no, sorry, but we did start talking about hashing.

Thanks for the pointer. Applogies again, the hash need not be perfect, just fast enough.

If I have understood “perfect hash” correctly, mine cannot be perfect, since the input range

(<2**32) is bigger than the space I have for the hash table. However typically I dont have collisions,

median=0,mean=1.9%, but worst observed is 48%. I am hoping the worst is not important, as it occurs

in a small dataset (<1/6th the size of the largest).

What I was worrying about was what I took to be a simple hash function seems to be expensive.

I.e, I estimate 32% of the time taken by my kernel is taken by hash() alone. However it may be it

is overlapping with reading from global memory effectively, so 32% is an underestimate of the time taken.

Running hash() many times in a tight loop, I estimate, half of a 295GTX,

gives about 153 million hashes per second.

Worryingly CUDA_PROFILE_LOG suggests only 54 instructions/microsecond.

__device__ int hash(const unsigned int mask, const int Nmask) {

  const int maxhash = Nmask;

  const float s = Nmask/logf(4294967296.0);

  int h = (( mask & 0xff)             ^

	   ((mask & 0xff00)    >>  8) ^

	   ((mask & 0xff0000)  >> 16) ^

	   ((mask & 0xff000000)>> 24)) +


  while(h>=maxhash) h -= maxhash; //fast modulo, loops 0 or 1 time only

  return h;


Let me summarize to make sure I understand your need:

you want a hash function that deterministically and pseudo-randomly maps an unsigned integer into a new range of 0-396. (Why that number? Perhaps that’s your free space in shared memory?)

You want this mapping to be fast. Collisions are OK but you still want to minimize them.

Then try this. The main idea is that integer multiplication is fast and is quite a good mixer by itself… but the best mixing is in the high bits of the product.

__device__ int hash(const unsigned int v) {

    v*=0x27253271;  // any ODD number with a decent mix of 0 and 1 bits in every byte. OK to try a few different values.

    v=v>>23; // just the top 9 bits. v is 0 to 511

    if (v>=397) v-=397;  // modulo 397 (maxhash)

    return v;


That hash is going to be very fast… enough that it will be hard to beat. It’s not the best hash in terms of mixing quality, but for most applications it will do just fine and you don’t want the expense of better mixing.

Shared memory limits me to about 400, I chose a prime number close to 400.

In the second pass I have more memory but bigger datasets so I chose 800.


I am using the code and constants you suggest, except with the bigger datasets I shift v by 22 bits

rather than 23.

With maxhash=397 I estimate a median of 0.16% clashes and a worse case of 15.9%

With maxhash=800 I estimate a median of 0.014% clashes and a worse case of 17.6%

These are about the same as mine.

CUDA_PROFILE_LOG suggests the instructions/microsecond has improved but only to 105.

I guess this means the kernel is still bogged down with I/O.

I estimate half of a 295 GTX generates 4478 million of your hashes per second.

30 times faster than my orginal code! (About 11% reduction in the whole application)