Cuda generating combinations.


Could CUDA be capable of calculating every combination of a set of numbers that add up to a given number then performing a calculation on each one? I.e:

Target num: 20
Num Buckets: 5

0, 0, 0, 0, 20
0, 0, 0, 1, 19
0, 0, 0, 2, 18
etc… up to:
19, 1, 0, 0, 0
20, 0, 0, 0, 0

I’m imaginging each thread somehow generates a row, but I’m unsure how I could generate each row in parallel?

Thanks alot.

Well, there is the brute force method. What you have there is a 5-dimensional space with coordinates ranging from 0 to 20. With a little imagination, you should be able to figure out a way to map each thread to a single point in that space. Each thread would essentially “check” to see if the sum totals the proper value and decides a 1 or a 0 and writes that out to global memory. If you need a list of JUST the rows that match sequentially, you can perform a stream compaction on that.

I just woke up, so I don’t currently see any options other than brute force. How would you do this efficiently on the CPU? That sometimes gives insights on what to do on the GPU, though sometimes the mapping to make it data-parallel is challenging.

Can you find a way to enumerate elements of your set? If possible you may be able to then map from an index value to the matching row configuration.

Perhaps it might be more efficient to have a single thread handle a single point in N-1 dimensional space. It could then recompute the sum of those values and find with the value of the Nth dimension is, since only one (if it exists) can actually complete the sum to the desired target.

This smells of a dynamic programming problem, though I don’t see the solution (I think that is the right term, the one that uses a table to turn calculating binomial coefficients from O(2^N) to O(N^2)). Since that is an iterative process, it probably will not map well to the GPU.

Why not something like :

n0 = 20
r0 = tid % n0
t1 = tid/n0

n1 = 20 - r0
r1 = t1%n1
t2 = t1/n1

n2 = 20-r0-r1
r2 = t2%n2
t3 = t2/n2

n3 = 20-r0-r1-r2
r3 = t3%n3

r4 = 20-r0-r1-r2-r3

r0…r4 are the numbers you are looking for.

Ooops, I think I made some errors :

first replace 20 by 21 to get the ri in the range 0…20

And even then , it does not work : you have duplicates (take tid = 20 and tid = 41, you get 20,0,0,0,0 both times.

The answer is more complex :

let P(i, j) be the number of permutation of i elements in j bins (you are looking for all the P(21, 5) permutations in your example)

let S(k, j) be the sum P(0, j)+P(1,j) +… + P(k, j) (and with S(-1, j) = 0 by convention)

r0 is the number that verifies S(r0-1, 5) <= tid < S(r0, 5)

then let t1 = tid-S(r0-1, 5) , and choose r1 : the number that verifies

S(r1-1, 4) <= t1 < S(r1, 4)

and so on with t2 = t1-S(r1-1, 4) …

I think this should work this time, correct me if I am still wrong.

PS : to fasten your computation, you can precompute all the P(i, j) and S(i, j), you have 5*21 = 105 of them.

Hmm, many interesting ideas… I wasn’t sure if it was doable, but perhaps working on a few of these ideas will help. Thanks for your time. :)