It depends on how your blocks/threads are setup and who is calculating what. If you can give more details on where you need to apply this operation, we can be more helpful. I will offer the following general suggestions, though.
If the vector is stored per thread, there is no “super instruction” to do the permuting for you. G80 is a scalar architecture. The best way to do this depends highly on the way the rest of your code is organized. I will say this though, doing it in global memory will be very, very slow.
If however, you have multiple threads working on a vector, then there is a very efficient way to do it. Put the vector in shared memory, then copy it to another location in shared memory based on the index (scatter). Since this is a permutation, every thread will be reading (and then writing) to distinct locations in shared memory and thus there will be no bank conflicts. It will be as fast and efficient as shared memory reads and writes.
I’m not sure how much detail you want, but here it goes…
My intent is to implement a neural network trainer/evaluator. The neural networks involved will sometimes be quite large, and many of these connections will be heavily shared. A conventional feed-forward neural network can be implemented entirely with flat dot products and sigmoid functions. However, to implement other network structures, I need something more general.
In the general case, I want to multiply a weight vector by a vector that is an arbitrarily ordered subset of the input vector.
I can take advantage of as much parallelism as is available because there are generally a large number of nodes that can be calculated in parallel.
Anyway I think what you are saying is that permutations would have to be implemented as sequential loops like
for (i=0; i<n; i++)
C[i] = A[B[i]]
though in multiple threads?
Assuming that I learn enough about memory handling and threading in a GPU environment, is it reasonable to expect the same kind of speed improvements relative to CPUs that I would get in the case of, say, dot products?
With small 3-element vectors, a loop of that type would need to be done in each thread, barring a really clever way to use shared memory. But if you are processing large sets of 3 element vectors, who cares? Just do each vector in parallel.
But you say that you’ll be working on larger vectors, so the problem becomes nicely parallel: Setup each thread of the block to run on an element of the vector. Then you skip the for loop and put in threadIdx.x in place of i. If C, A, and B are all in shared memory this will be a very fast operation. It may require some clever sliding window technique if your vectors are too large to fit in shared mem all at once, but it shouldn’t be too bad to implement.
i.e. the for loop is replaced with C[threadIdx.x] = A[B[threadIdx.x]]. Sometimes, people keep the for loop there and write “for i in parallel”.
Sure thing. As long as you keep your global memory access patterns fully coalesced, you should get some pretty insane speedups with a neural network code. From what I remember about how neural networks are implemented, this should be do-able.
@KenSeehart: Be aware however of the implications of the parallel execution of this code. The permutation has to be an injective function for example. Some neuronal propagation might no always satisfy this.
I am also stuck with the similar kind of problem: trying to implement permutation on a very large array. C[threadIdx.x] = A[B[threadIdx.x]] will only work until all the indexes of A (computed by B[threadIdx.x]) reside in the shared memory; which will often not be the case. I was thinking of using some key-value pair sorting technique
to do the permutation
by modifying my problem as C[B[i]] = A[i]( instead of the original problem C[i] = A[B’[i]]). B[i] will define the keys and A[i] will be the corresponding values. Hence,
sorting key-value pair (B, A) by keys will automatically give the required permuted vector. Can anyone share their thought on this? Or, is there any other way to do the permuations efficiently using CUDA ? Does their already an efficient cuda implementation of key-value pair sorting technique exist which I can use for implementing permutations?
If values in B are somewhat spatially local, this will operate at the full device memory bandwidth of ~70 GiB/s (8800 GTX/Tesla). If the values in B are completely random, I would still expect ~20-30 GiB/s. Any fancy graph algorithm is going to require reading more bytes and will just slow things down.
Treat your input as a vector and the permutation shall correspond to a permutation matrix. You can get more information from http://en.wikipedia.org/wiki/Permutation_matrix. And then, this problem is a simply a Linear algebra transformation problem.