Shufle a vector inside a kernel


my question is similar to this one but I don’t really understand how to do it in a real application:

I have a vector with 99 elements and I want to shuffle the elements to produce 1000 to 8000 different permutations.

Given that this process is done as part of another kernel is it more efficient to do it in the host or the device?

And even if it is less efficient,how would I do it in the GPU?

Here is the host code:

void generate_rand_perm(int* array_of_perm,int maxN,int N_matrices)


  srand ( time(NULL) );

  int j;

  for (int matrix=0;matrix<N_matrices;matrix++){

      int* p=array_of_perm+matrix*maxN;

     for (int i = 0; i < maxN; i++) {

    j = rand() % (i + 1);

     p[i] = p[j];

     p[j] = i;


   for (int j=0;j<maxN;j++)




n_matrices is 1000 to 8000 (depending on the kernel)

maxN is 99

Thanks in advance,


If the data is on the GPU, then do the permute on the GPU. This will be three or four orders of magnitude faster than sending it to the CPU.

As I mention in that old thread you linked to, you could make a random-ish function for bijective remapping the integers 0 to 98 to a permutation of 0 to 98.

It depends on how truly random you want that permutation, but here’s one I’m just typing this into the forum without testing, but it should work OK.

/* index is 0 to 98 inclusive, output is the permute for that index */

__device__ int remap99(int index)


  int m=17; // other values may work, but must be relatively prime to 99!

  int c=11; // any value OK

  return (m*index+c)%99; // % mod operator is SLLLOWWW but this 99 mod value is known at compile time. Compiler should change to fast integer ops


__device__ void permute99array(int *array)


  int tid=threadIndex.x;

  int v=array[remap99(tid)]; // scattered reads, but cache will hide this well


  array[tid]=v; // coherent write



Give this strategy a try. Its quality can be strengthened by replacing the cheesy LCG mod 99 with something more robust, but this is a place to start and may be enough for your app.

To get 1000+ different permutations, you need to start parameterizing that remap() function, and probably move past the cheesy LCG.

I don’t know if I were clear on that,I probably wasn’t.
The contents of the vector are numbers 1 to 99 in increasing array[0] is 1,array[1]=2 etc.
So there is no data transfer needed compute the permutations,since they are only permuted indexes.
The algorithm is fisher-yates.
So the question that is left is the following:
Is it more efficient to produce them at the host and transferring them to the GPU or just create them there.I think the second solution is better although I am not sure how much better.
Also,given that the algorithm is part of a bigger machine learning application good quality shuffling is important.
Could you point me anywhere for further reading?
Thanks a lot for your code BTW,i will try it.