Accelerating permutations

JBAP · January 12, 2016, 6:07am

Hello,

I have two long arrays in global GPU memory, A and B

A, represents a permutation.

B some data.

I want to permute B according to A.

What is the best way to paralleliza and accelerate this using a GPU ?

Are there specific Cuda libraries for this ?

Thanks,

J

CLARIFICATION: The data vector is something like [2.1, -3, 4, 5, 6, 2.7, 1, 2, 5.1, 6.7] and the permutation matrix something like [1, 5, 2, 7, 3, 9, 0, 4, 8] and the output should be [-3, 2.7, 4, 2, 5, 6.7, 2.1, 6, 5.1 ]. All vectors, in the end, should reside in global memory. They are very long. I am FLEXIBLE regarding the representation for the permutation. Currently, A directly gives the final assignment of each entry of A. And I am OK with OFFLINE pre computations before the permutation is applied. In fact, the SAME permutation operation will be applied repeatedly to the same input vector (that in the meanwhile will be changed by some other code) many times.

CLARIFICATION 2: The data vector and the permutation vectors can have 100k entries or even more. Up to millions.
Each element in the data vector can be just one number or, say a block of 10 numbers that need to be moved, in block, to some other location. I do not need to make then permutation in place. I can write the result to some other array. Either way is fine. I am familiar with the paper An Optimal Offline Permutation Algorithm on the Hierarchical Memory Machine, with the GPU implementation | hgpu.org , They basically transform a permutation into a composition of permutations that can be performed efficient on the GPU. Their method is rather complex and I’ve not implemented it yet.
But given the amount of GPU libraries out there, maybe there is a simple way to do this ? Maybe express the permutation as a sparse permutation matrix and then use cuSparse or cuBlas ?! Or maybe something else ?

njuffa · January 12, 2016, 8:35am

What is the context of this operation? Where is the permutation vector coming from? Use of a permutation vector implies serial dependencies, while the use of the equivalent index vector allows for copious parallelism (but may still be a performance bottleneck due to “random” memory access). Could you create an index vector instead of a permutation vector? Can you convert the permutation vector to an index vector offline?

BulatZiganshin · January 12, 2016, 2:04pm

since those gpus can perform only one random memory access per gpu cycle, it greatly depends on the element size. for short elements it’s more efficient to sort arrays using permutation index as a key. use radix sort from Thrust or CUB

njuffa · January 12, 2016, 4:20pm

I think it would help to specify exactly what kind of permutation vector we are considering here. For example, LAPACK’s DGETRF uses a permutation vector which specifies swaps between rows in a matrix as follows:

(5, 6, 3, 4) =>

swap row 1 with row 5
swap row 2 with row 6
swap row 3 with row 3 // no action
swap row 4 with row 4 // no action

This is the kind of permutation vector that creates a serial dependency between row swaps which can be removed by converting the permutation vector into index vector which directly gives the final assignment of each source row to each target row.

JBAP · January 12, 2016, 5:03pm

CLARIFICATION: The data vector is something like [2.1, -3, 4, 5, 6, 2.7, 1, 2, 5.1, 6.7] and the permutation matrix something like [1, 5, 2, 7, 3, 9, 0, 4, 8] and the output should be [-3, 2.7, 4, 2, 5, 6.7, 2.1, 6, 5.1 ]. All vectors, in the end, should reside in global memory. They are very long. I am FLEXIBLE regarding the representation for the permutation. Currently, A directly gives the final assignment of each entry of A. And I am OK with OFFLINE pre computations before the permutation is applied. In fact, the SAME permutation operation will be applied repeatedly to the same input vector (that in the meanwhile will be changed by some other code) many times.

BulatZiganshin · January 12, 2016, 5:15pm

what is “very long” and what is the size of element? and your gpu? how much temporary memory you can use due these operations? are you need to permute in-place or write results to another array?

Vectorizer · January 12, 2016, 9:49pm

Permutation is a very low arithmetic intensity operation. Moreover, it potentially has irregular memory access pattern on the write side.
There may be special cases though, for example if the permutation is segmented, one can look at using intra-warp shuffle or shared memory.

CudaaduC · January 13, 2016, 3:16am

This link may help;

[url]An Optimal Offline Permutation Algorithm on the Hierarchical Memory Machine, with the GPU implementation | hgpu.org

This is different than actually generating and evaluating all N! permutations of an array, as that number gets very large when N>=15.

When you say a ‘large’ array that is interpreted as being some value much greater than your examples.

Vectorizer · January 13, 2016, 3:21pm

Neither the discussion here, nor the text in link you provided mention “generating and evaluating all N! permutations of an array”.

JBAP · January 16, 2016, 12:56am

CLARIFICATION 2: The data vector and the permutation vectors can have 100k entries or even more. Up to millions.
Each element in the data vector can be just one number or, say a block of 10 numbers that need to be moved, in block, to some other location. I do not need to make then permutation in place. I can write the result to some other array. Either way is fine. I am familiar with the paper An Optimal Offline Permutation Algorithm on the Hierarchical Memory Machine, with the GPU implementation | hgpu.org , They basically transform a permutation into a composition of permutations that can be performed efficient on the GPU. Their method is rather complex and I’ve not implemented it yet.
But given the amount of GPU libraries out there, maybe there is a simple way to do this ? Maybe express the permutation as a sparse permutation matrix and then use cuSparse or cuBlas ?! Or maybe something else ?

BulatZiganshin · January 16, 2016, 1:03am

then what about the easiest method:

out[i] = input[index[i]]

??

njuffa · January 16, 2016, 1:15am

Why not run a quick experiment to see what happens? You can then use the CUDA profiler to see what limits your first-cut code. If your index sequence has a specific structure, you could then contemplate how you might be able to take advantage of that.

Topic		Replies	Views
Can CUDA do permutations CUDA Programming and Performance	11	12647	March 30, 2012
non recursive permute CUDA Programming and Performance	4	1096	October 13, 2014
Optimizing Vector elements swaps using CUDA the swaps involved are not direct!! CUDA Programming and Performance	9	1771	May 21, 2010
Cuda Latency problems Slow Cuda CUDA Programming and Performance	15	13941	September 5, 2008
How to add pointer array value CUDA Programming and Performance	13	1759	May 2, 2019
Global arrays? CUDA Programming and Performance	24	10638	August 18, 2010
Tell me a way to GPU speed up with banal elementwise multiplication.. Thanks in advance! CUDA Programming and Performance	15	3600	August 5, 2017
Transfer-Bound Application Looking for ideas to speed it up CUDA Programming and Performance	36	29340	April 23, 2010
Any ideas on GF(2) sparse matrix-vector multiply? CUDA Programming and Performance	5	5854	August 7, 2008
GPU - CPU Performance comparison on string conversion i7 860 3.5GHz beat out NVidia 9800 GT CUDA Programming and Performance	11	10669	January 4, 2011

Accelerating permutations

Related topics