CUDA version of visiting all subset combinations

I have not yet seen a GPU version of the standard subset algorithm, so here is my version on Github;

The README with table of output;

The main .cu file;

It ends up being a simple scan, but is much faster than the single-threaded CPU version. I know there are some on here who are scan/reduction pros, so any input on how to speed it up would be appreciated.

I also made a version which uses Atomics, but it was not faster. For larger subsets (n>30 && n<63) I will have to use 64-bit numbers, and I will post that version soon. Also applied this same idea to go through all permutations, and will post that soon(someone else has already done that, but I think my version is faster).

If anybody actually look at this, please let me know if I used too many(or too few) __syncthreads().

Also wondering why for Kepler using 256 threads per block seems to be fastest. Am guessing that is the ‘sweet spot’ in terms of occupancy.

Also updated this ‘visit all distinct combinations,evaluate each based on some criteria, and cache the optimal value WITH the optimal combination information’ to handle the 64 bit numbers.

Like with the permutations I had to make some adjustments to the main kernel and reduction/scan steps, size of thread blocks etc…

for N=31, number of subsets=2147483648, number of iterations per thread=31,reduction steps in kernel per thread(conservative estimate)=6,iterations in thrust::max_element=4194304

for a total of 2147483648316 + 4194304 = 399,436,152,832 iterations/operations

time for N=31, 0.79 seconds on the GPU for complete time (including all memory copies,allocations etc)

for N=35, number of subsets=34359738368, number of iterations per thread=35, reduction steps in kernel per thread(conservative estimate)=6, iterations in thrust::max_element=67108864

for a total of 34359738368356 + 67108864= 7,215,612,166,144 iterations/operations

time for N=35, 14.4 seconds on the GPU for complete time (including all memory copies,allocations etc)

Remember that the running time is not only the time taken to generate each distinct combination, but also to evaluate each combination against characteristics/dependencies in constant memory, then cache the optimal result and the unique combination which resulted in that optimal result. Then the final reduction step is done to return the single best value and the list of indices which resulted in that value.

I am sure there are a few more optimizations that can be done to get a speed increase, but this is not bad I think… As far as I can tell no other person/group has made a GPU version of this algorithm, but if anyone here can best this in terms of times let me know.

Verified the results to be accurate but did not compare against the CPU version because I just do not have that much free time to wait.

One last update when running correctly for N=36 (2^36 possibilities)

for N=36, number of subsets= 68719476736 , number of iterations per thread=36, reduction steps in kernel per thread(conservative estimate)=6, iterations in thrust::max_element= 134217728

for a total of 68719476736366 + 134217728= 14,843,541,192,704 operations

time for N=36, 29.67 seconds on the GPU for complete time(including all memory copies,allocations, etc)

@CudaaduC, you might find this book useful.

It’s full of extreme bit-wizardry algorithms. It’s rather awesome.

[ The author offers the PDF free on his site but I really like the first comment on Amazon. :) ]

Great, thank you!

The hard part is mapping them to the parallel model. I think there will be many ‘hybrid’ solutions in the future which break up the appropriate workloads between the CPU and GPU.

CudaaduC, from what I can tell this is a constrained version of the set cover problem, which is known to be NP-Hard. This means that the only known algorithms that can produce the best answer run in worst-case exponential time O(c^N). In your example, it looks like c is about 2, so doubling the problems size by scaling up from N=35 to N=70 should take 2^35 (34359738368) times longer than 14.4 seconds.

However, heuristics do exist for most of these problems that run much faster than exponential time for many common problems. It’s probable that one of these heuristics would be able to solve a problem with N=200 in less than a second.

In general, people have not had a lot of success running these problems on parallel hardware because the 100x speedup you get over a CPU gets wiped out by the exponential increase in runtime with larger problems. The difference in performance that you get from moving between different heuristics can be exponential (e.g. 2^100 times faster isn’t out of the question), so people typically spend their efforts focusing on the heuristic design rather than looking for the fastest HW.

There is also the problem that the best heuristics typically maintain a shared global database that is updated more or less randomly, making them highly irregular. They also typically involve searching subsets of the problem space in a very deliberate order, making them highly sequential.

I would love to see more work in this area, but make sure that you are familiar with related work before you try. Here be dragons.

You may want to check out some of the video lectures here: (week 5 and later) (everything)

Or if you don’t have a lot of time, at least read these:


You are right about that, but I was just excited about creating a CUDA version of a brute-force subset evaluation algorithm which I had not seen done before. Since I am relatively new to CUDA the best way for me to learn how to convert serial implementations into a parallel was to take ‘top-coder’ like problems and write a GPU version.

This obviously will not scale well, but if N<=36 it is fast enough to be useful. 2^36 is large enough where most CPUs will not be fast enough, but a good GPU will.

As far as dynamic programming goes, I did update my version of the Floyd-Warshall APSP(with path reconstruction) algorithm which is about 48x faster than a good serial CPU version. For highly connected directed graphs this one of the most useful algorithms, but unfortunately is N^3 complexity. Still was able to run on a graph (in adjacency matrix form) of 11,111 vertices in 108 seconds on GPU, which gives you the all the shortest paths, and enumerates the entire start-to-finish path with each partial distance.

Dynamic programming algorithms are my favorite, but in the past I always favored the recursive(with memoization) implementations which are not really practical on GPUs, and that adds to the challenge.

As long as no one objects I will continue to map these practical algorithms to the CUDA architecture. Even if the data set size is limited, they still can be useful.

Don’t let me stop you if you are interested in learning about algorithms and CUDA. I wish that other people would spend time looking at these problems.

I just wanted to make sure that you know that using an existing optimization library will probably perform better.

Edit: If you are really looking for a challenge, try implementing some of the algorithms described in this class in CUDA: