Can CUDA do permutations

KenSeehart · June 10, 2007, 6:27pm

Hi. I’m a newbie. External Image

Suppose I have a floating point vector A and an integer vector B with the same dimension as A.

I would like to do the permutation operation C = A[B]

e.g.:

A = [3.14, 1.618, 2.718]
B = [1, 0, 2] (assuming zero based indexes)

→ C = [1.618, 3.14, 2.718]

Can this be done (as a high speed vector operation) using CUDA?

Ken Seehart

MisterAnderson42 · June 11, 2007, 12:48am

It depends on how your blocks/threads are setup and who is calculating what. If you can give more details on where you need to apply this operation, we can be more helpful. I will offer the following general suggestions, though.

If the vector is stored per thread, there is no “super instruction” to do the permuting for you. G80 is a scalar architecture. The best way to do this depends highly on the way the rest of your code is organized. I will say this though, doing it in global memory will be very, very slow.

If however, you have multiple threads working on a vector, then there is a very efficient way to do it. Put the vector in shared memory, then copy it to another location in shared memory based on the index (scatter). Since this is a permutation, every thread will be reading (and then writing) to distinct locations in shared memory and thus there will be no bank conflicts. It will be as fast and efficient as shared memory reads and writes.

KenSeehart · June 11, 2007, 4:20am

I’m not sure how much detail you want, but here it goes…

My intent is to implement a neural network trainer/evaluator. The neural networks involved will sometimes be quite large, and many of these connections will be heavily shared. A conventional feed-forward neural network can be implemented entirely with flat dot products and sigmoid functions. However, to implement other network structures, I need something more general.

In the general case, I want to multiply a weight vector by a vector that is an arbitrarily ordered subset of the input vector.

I can take advantage of as much parallelism as is available because there are generally a large number of nodes that can be calculated in parallel.

Anyway I think what you are saying is that permutations would have to be implemented as sequential loops like

for (i=0; i<n; i++)

    C[i] = A[B[i]]

though in multiple threads?

Assuming that I learn enough about memory handling and threading in a GPU environment, is it reasonable to expect the same kind of speed improvements relative to CPUs that I would get in the case of, say, dot products?

KenSeehart · June 11, 2007, 4:24am

When you say that “G80 is a scalar architecture” are you saying that the GPU is just like an array of specialized CPUs working in parallel, but with each one actually iterating vectors sequentially?

wumpus · June 11, 2007, 7:55am

Exactly!

MisterAnderson42 · June 11, 2007, 3:14pm

I can take advantage of as much parallelism as is available because there are generally a large number of nodes that can be calculated in parallel.

Anyway I think what you are saying is that permutations would have to be implemented as sequential loops like
for (i=0; i<n; i++)

    C[i] = A[B[i]]
though in multiple threads?

With small 3-element vectors, a loop of that type would need to be done in each thread, barring a really clever way to use shared memory. But if you are processing large sets of 3 element vectors, who cares? Just do each vector in parallel.

But you say that you’ll be working on larger vectors, so the problem becomes nicely parallel: Setup each thread of the block to run on an element of the vector. Then you skip the for loop and put in threadIdx.x in place of i. If C, A, and B are all in shared memory this will be a very fast operation. It may require some clever sliding window technique if your vectors are too large to fit in shared mem all at once, but it shouldn’t be too bad to implement.

i.e. the for loop is replaced with C[threadIdx.x] = A[B[threadIdx.x]]. Sometimes, people keep the for loop there and write “for i in parallel”.

Sure thing. As long as you keep your global memory access patterns fully coalesced, you should get some pretty insane speedups with a neural network code. From what I remember about how neural networks are implemented, this should be do-able.

prkipfer · June 11, 2007, 3:27pm

Right.

@KenSeehart: Be aware however of the implications of the parallel execution of this code. The permutation has to be an injective function for example. Some neuronal propagation might no always satisfy this.

Peter

e.ping · May 30, 2008, 12:49am

With small 3-element vectors, a loop of that type would need to be done in each thread, barring a really clever way to use shared memory. But if you are processing large sets of 3 element vectors, who cares? Just do each vector in parallel.

But you say that you’ll be working on larger vectors, so the problem becomes nicely parallel: Setup each thread of the block to run on an element of the vector. Then you skip the for loop and put in threadIdx.x in place of i. If C, A, and B are all in shared memory this will be a very fast operation. It may require some clever sliding window technique if your vectors are too large to fit in shared mem all at once, but it shouldn’t be too bad to implement.

i.e. the for loop is replaced with C[threadIdx.x] = A[B[threadIdx.x]]. Sometimes, people keep the for loop there and write “for i in parallel”.

Sure thing. As long as you keep your global memory access patterns fully coalesced, you should get some pretty insane speedups with a neural network code. From what I remember about how neural networks are implemented, this should be do-able.

[snapback]207971[/snapback]

I am also stuck with the similar kind of problem: trying to implement permutation on a very large array. C[threadIdx.x] = A[B[threadIdx.x]] will only work until all the indexes of A (computed by B[threadIdx.x]) reside in the shared memory; which will often not be the case. I was thinking of using some key-value pair sorting technique

to do the permutation

by modifying my problem as C[B[i]] = A[i]( instead of the original problem C[i] = A[B’[i]]). B[i] will define the keys and A[i] will be the corresponding values. Hence,

sorting key-value pair (B, A) by keys will automatically give the required permuted vector. Can anyone share their thought on this? Or, is there any other way to do the permuations efficiently using CUDA ? Does their already an efficient cuda implementation of key-value pair sorting technique exist which I can use for implementing permutations?

halyavin · May 30, 2008, 7:29am

I am also stuck with the similar kind of problem: trying to implement permutation on a very large array. C[threadIdx.x] = A[B[threadIdx.x]] will only work until all the indexes of A (computed by B[threadIdx.x]) reside in the shared memory; which will often not be the case. I was thinking of using some key-value pair sorting technique

to do the permutation

by modifying my problem as C[B[i]] = A[i]( instead of the original problem C[i] = A[B’[i]]). B[i] will define the keys and A[i] will be the corresponding values. Hence,

sorting key-value pair (B, A) by keys will automatically give the required permuted vector. Can anyone share their thought on this? Or, is there any other way to do the permuations efficiently using CUDA ? Does their already an efficient cuda implementation of key-value pair sorting technique exist which I can use for implementing permutations?

[snapback]385741[/snapback]

I am also interesting what is faster - C[i]=A[B[i]] or C[B’[i]]=A[i]. Graph algorithms use such kind of operations a lot.

MisterAnderson42 · May 30, 2008, 12:50pm

For large vectors that don’t fit in shared memory my guess is C[i] = A[B[i]] would be fastest. You can coalesce the writes and perform the reads with a texture:

idx = blockIdx.x * blockDim.x + threadIdx.x;

C[idx] = tex1Dfetch(A_tex, B[idx]);

If values in B are somewhat spatially local, this will operate at the full device memory bandwidth of ~70 GiB/s (8800 GTX/Tesla). If the values in B are completely random, I would still expect ~20-30 GiB/s. Any fancy graph algorithm is going to require reading more bytes and will just slow things down.

William_Wang · March 30, 2012, 4:20am

Treat your input as a vector and the permutation shall correspond to a permutation matrix. You can get more information from Permutation matrix - Wikipedia. And then, this problem is a simply a Linear algebra transformation problem.

DrAnderson42 · March 30, 2012, 12:17pm

Did you realize that you revived a 4 year old thread?
Using a permutation matrix is going to result in a massive number of additional memory accesses vs simple indirect indexing, and will run much slower as a consequence.

Topic		Replies	Views
Accelerating permutations CUDA Programming and Performance	11	4265	January 16, 2016
Help with some CUDA concepts CUDA Programming and Performance	7	1448	August 16, 2009
Is it possible to process multidimensional arrays inside the kernel? CUDA Programming and Performance	13	9026	March 31, 2015
GPU vs CPU - how large can threads be? CUDA Programming and Performance	8	2346	May 12, 2010
Sum vectors CUDA Programming and Performance	24	4900	December 21, 2011
Massive "simple" computation with CUDA CUDA Programming and Performance	14	8595	December 7, 2009
cuda integer operations and simt for sorting CUDA Programming and Performance	7	8870	July 25, 2009
Iteration help in CUDA CUDA Programming and Performance	11	6857	April 19, 2012
Cuda Latency problems Slow Cuda CUDA Programming and Performance	15	13928	September 5, 2008
Can CUDA do sequential processing? CUDA Programming and Performance	7	6530	August 24, 2011

Can CUDA do permutations

Related topics