Shufle a vector inside a kernel

apostglen46 · April 30, 2012, 10:47pm

Hello,

my question is similar to this one but I don’t really understand how to do it in a real application:The Official NVIDIA Forums | NVIDIA

I have a vector with 99 elements and I want to shuffle the elements to produce 1000 to 8000 different permutations.

Given that this process is done as part of another kernel is it more efficient to do it in the host or the device?

And even if it is less efficient,how would I do it in the GPU?

Here is the host code:

void generate_rand_perm(int* array_of_perm,int maxN,int N_matrices)

{

  srand ( time(NULL) );

  int j;

  for (int matrix=0;matrix<N_matrices;matrix++){

      int* p=array_of_perm+matrix*maxN;

     for (int i = 0; i < maxN; i++) {

    j = rand() % (i + 1);

     p[i] = p[j];

     p[j] = i;

  }

   for (int j=0;j<maxN;j++)

           p[j]=p[j]+1;

}

}

n_matrices is 1000 to 8000 (depending on the kernel)

maxN is 99

Thanks in advance,

Apostolis

SPWorley · May 1, 2012, 1:17pm

If the data is on the GPU, then do the permute on the GPU. This will be three or four orders of magnitude faster than sending it to the CPU.

As I mention in that old thread you linked to, you could make a random-ish function for bijective remapping the integers 0 to 98 to a permutation of 0 to 98.

It depends on how truly random you want that permutation, but here’s one I’m just typing this into the forum without testing, but it should work OK.

/* index is 0 to 98 inclusive, output is the permute for that index */

__device__ int remap99(int index)

{

  int m=17; // other values may work, but must be relatively prime to 99!

  int c=11; // any value OK

  return (m*index+c)%99; // % mod operator is SLLLOWWW but this 99 mod value is known at compile time. Compiler should change to fast integer ops

}

__device__ void permute99array(int *array)

{

  int tid=threadIndex.x;

  int v=array[remap99(tid)]; // scattered reads, but cache will hide this well

  __syncthreads();

  array[tid]=v; // coherent write

  __syncthreads();

}

Give this strategy a try. Its quality can be strengthened by replacing the cheesy LCG mod 99 with something more robust, but this is a place to start and may be enough for your app.

To get 1000+ different permutations, you need to start parameterizing that remap() function, and probably move past the cheesy LCG.

apostglen46 · May 1, 2012, 4:09pm

I don’t know if I were clear on that,I probably wasn’t.
The contents of the vector are numbers 1 to 99 in increasing order.ie array[0] is 1,array[1]=2 etc.
So there is no data transfer needed compute the permutations,since they are only permuted indexes.
The algorithm is fisher-yates.
So the question that is left is the following:
Is it more efficient to produce them at the host and transferring them to the GPU or just create them there.I think the second solution is better although I am not sure how much better.
Also,given that the algorithm is part of a bigger machine learning application good quality shuffling is important.
Could you point me anywhere for further reading?
Thanks a lot for your code BTW,i will try it.

Topic		Replies	Views
Permuting a vector CUDA Programming and Performance	1	964	June 13, 2010
Array Shuffle in CUDA? CUDA Programming and Performance	8	8854	March 31, 2011
Accelerating permutations CUDA Programming and Performance	11	4490	January 16, 2016
Can CUDA do permutations CUDA Programming and Performance	11	12703	March 30, 2012
non recursive permute CUDA Programming and Performance	4	1115	October 13, 2014
Does there exist a shuffle algorithm which enables shuffling elements of a vector on GPU/device? GPU-Accelerated Libraries	1	591	September 8, 2017
Random shuffling an array in parallel Other Tools cuda , kernel	0	555	April 9, 2020
best practise for permutations and combination CUDA Programming and Performance	0	720	March 23, 2011
In place cube reshuffling CUDA Programming and Performance	8	12592	November 15, 2007
CUDA version of STL next_permutation() CUDA Programming and Performance	4	2049	March 4, 2014

Shufle a vector inside a kernel

Related topics