Accelerating permutations

Hello,

I have two long arrays in global GPU memory, A and B

A, represents a permutation.

B some data.

I want to permute B according to A.

What is the best way to paralleliza and accelerate this using a GPU ?

Are there specific Cuda libraries for this ?

Thanks,

J

CLARIFICATION: The data vector is something like [2.1, -3, 4, 5, 6, 2.7, 1, 2, 5.1, 6.7] and the permutation matrix something like [1, 5, 2, 7, 3, 9, 0, 4, 8] and the output should be [-3, 2.7, 4, 2, 5, 6.7, 2.1, 6, 5.1 ]. All vectors, in the end, should reside in global memory. They are very long. I am FLEXIBLE regarding the representation for the permutation. Currently, A directly gives the final assignment of each entry of A. And I am OK with OFFLINE pre computations before the permutation is applied. In fact, the SAME permutation operation will be applied repeatedly to the same input vector (that in the meanwhile will be changed by some other code) many times.

CLARIFICATION 2: The data vector and the permutation vectors can have 100k entries or even more. Up to millions.
Each element in the data vector can be just one number or, say a block of 10 numbers that need to be moved, in block, to some other location. I do not need to make then permutation in place. I can write the result to some other array. Either way is fine. I am familiar with the paper An Optimal Offline Permutation Algorithm on the Hierarchical Memory Machine, with the GPU implementation | hgpu.org , They basically transform a permutation into a composition of permutations that can be performed efficient on the GPU. Their method is rather complex and I’ve not implemented it yet.
But given the amount of GPU libraries out there, maybe there is a simple way to do this ? Maybe express the permutation as a sparse permutation matrix and then use cuSparse or cuBlas ?! Or maybe something else ?

What is the context of this operation? Where is the permutation vector coming from? Use of a permutation vector implies serial dependencies, while the use of the equivalent index vector allows for copious parallelism (but may still be a performance bottleneck due to “random” memory access). Could you create an index vector instead of a permutation vector? Can you convert the permutation vector to an index vector offline?

since those gpus can perform only one random memory access per gpu cycle, it greatly depends on the element size. for short elements it’s more efficient to sort arrays using permutation index as a key. use radix sort from Thrust or CUB

I think it would help to specify exactly what kind of permutation vector we are considering here. For example, LAPACK’s DGETRF uses a permutation vector which specifies swaps between rows in a matrix as follows:

(5, 6, 3, 4) =>

swap row 1 with row 5
swap row 2 with row 6
swap row 3 with row 3 // no action
swap row 4 with row 4 // no action

This is the kind of permutation vector that creates a serial dependency between row swaps which can be removed by converting the permutation vector into index vector which directly gives the final assignment of each source row to each target row.

CLARIFICATION: The data vector is something like [2.1, -3, 4, 5, 6, 2.7, 1, 2, 5.1, 6.7] and the permutation matrix something like [1, 5, 2, 7, 3, 9, 0, 4, 8] and the output should be [-3, 2.7, 4, 2, 5, 6.7, 2.1, 6, 5.1 ]. All vectors, in the end, should reside in global memory. They are very long. I am FLEXIBLE regarding the representation for the permutation. Currently, A directly gives the final assignment of each entry of A. And I am OK with OFFLINE pre computations before the permutation is applied. In fact, the SAME permutation operation will be applied repeatedly to the same input vector (that in the meanwhile will be changed by some other code) many times.

what is “very long” and what is the size of element? and your gpu? how much temporary memory you can use due these operations? are you need to permute in-place or write results to another array?

Permutation is a very low arithmetic intensity operation. Moreover, it potentially has irregular memory access pattern on the write side.
There may be special cases though, for example if the permutation is segmented, one can look at using intra-warp shuffle or shared memory.

This link may help;

[url]An Optimal Offline Permutation Algorithm on the Hierarchical Memory Machine, with the GPU implementation | hgpu.org

This is different than actually generating and evaluating all N! permutations of an array, as that number gets very large when N>=15.

When you say a ‘large’ array that is interpreted as being some value much greater than your examples.

Neither the discussion here, nor the text in link you provided mention “generating and evaluating all N! permutations of an array”.

CLARIFICATION 2: The data vector and the permutation vectors can have 100k entries or even more. Up to millions.
Each element in the data vector can be just one number or, say a block of 10 numbers that need to be moved, in block, to some other location. I do not need to make then permutation in place. I can write the result to some other array. Either way is fine. I am familiar with the paper An Optimal Offline Permutation Algorithm on the Hierarchical Memory Machine, with the GPU implementation | hgpu.org , They basically transform a permutation into a composition of permutations that can be performed efficient on the GPU. Their method is rather complex and I’ve not implemented it yet.
But given the amount of GPU libraries out there, maybe there is a simple way to do this ? Maybe express the permutation as a sparse permutation matrix and then use cuSparse or cuBlas ?! Or maybe something else ?

then what about the easiest method:

out[i] = input[index[i]]

??

Why not run a quick experiment to see what happens? You can then use the CUDA profiler to see what limits your first-cut code. If your index sequence has a specific structure, you could then contemplate how you might be able to take advantage of that.