I have two long arrays in global GPU memory, A and B
A, represents a permutation.
B some data.
I want to permute B according to A.
What is the best way to paralleliza and accelerate this using a GPU ?
Are there specific Cuda libraries for this ?
CLARIFICATION: The data vector is something like [2.1, -3, 4, 5, 6, 2.7, 1, 2, 5.1, 6.7] and the permutation matrix something like [1, 5, 2, 7, 3, 9, 0, 4, 8] and the output should be [-3, 2.7, 4, 2, 5, 6.7, 2.1, 6, 5.1 ]. All vectors, in the end, should reside in global memory. They are very long. I am FLEXIBLE regarding the representation for the permutation. Currently, A directly gives the final assignment of each entry of A. And I am OK with OFFLINE pre computations before the permutation is applied. In fact, the SAME permutation operation will be applied repeatedly to the same input vector (that in the meanwhile will be changed by some other code) many times.
CLARIFICATION 2: The data vector and the permutation vectors can have 100k entries or even more. Up to millions.
Each element in the data vector can be just one number or, say a block of 10 numbers that need to be moved, in block, to some other location. I do not need to make then permutation in place. I can write the result to some other array. Either way is fine. I am familiar with the paper http://hgpu.org/?p=10937 , They basically transform a permutation into a composition of permutations that can be performed efficient on the GPU. Their method is rather complex and I’ve not implemented it yet.
But given the amount of GPU libraries out there, maybe there is a simple way to do this ? Maybe express the permutation as a sparse permutation matrix and then use cuSparse or cuBlas ?! Or maybe something else ?