If the data is on the GPU, then do the permute on the GPU. This will be three or four orders of magnitude faster than sending it to the CPU.
As I mention in that old thread you linked to, you could make a random-ish function for bijective remapping the integers 0 to 98 to a permutation of 0 to 98.
It depends on how truly random you want that permutation, but here’s one I’m just typing this into the forum without testing, but it should work OK.
/* index is 0 to 98 inclusive, output is the permute for that index */
__device__ int remap99(int index)
{
int m=17; // other values may work, but must be relatively prime to 99!
int c=11; // any value OK
return (m*index+c)%99; // % mod operator is SLLLOWWW but this 99 mod value is known at compile time. Compiler should change to fast integer ops
}
__device__ void permute99array(int *array)
{
int tid=threadIndex.x;
int v=array[remap99(tid)]; // scattered reads, but cache will hide this well
__syncthreads();
array[tid]=v; // coherent write
__syncthreads();
}
Give this strategy a try. Its quality can be strengthened by replacing the cheesy LCG mod 99 with something more robust, but this is a place to start and may be enough for your app.
To get 1000+ different permutations, you need to start parameterizing that remap() function, and probably move past the cheesy LCG.
I don’t know if I were clear on that,I probably wasn’t.
The contents of the vector are numbers 1 to 99 in increasing order.ie array[0] is 1,array[1]=2 etc.
So there is no data transfer needed compute the permutations,since they are only permuted indexes.
The algorithm is fisher-yates.
So the question that is left is the following:
Is it more efficient to produce them at the host and transferring them to the GPU or just create them there.I think the second solution is better although I am not sure how much better.
Also,given that the algorithm is part of a bigger machine learning application good quality shuffling is important.
Could you point me anywhere for further reading?
Thanks a lot for your code BTW,i will try it.