I have not yet seen a GPU version of the standard subset algorithm, so here is my version on Github;
The README with table of output;
[url]https://github.com/OlegKonings/CUDA_ALL_SUBSETS[/url]
The main .cu file;
[url]https://github.com/OlegKonings/CUDA_ALL_SUBSETS/blob/master/Combo/Combo/combo_main.cu[/url]
It ends up being a simple scan, but is much faster than the single-threaded CPU version. I know there are some on here who are scan/reduction pros, so any input on how to speed it up would be appreciated.
I also made a version which uses Atomics, but it was not faster. For larger subsets (n>30 && n<63) I will have to use 64-bit numbers, and I will post that version soon. Also applied this same idea to go through all permutations, and will post that soon(someone else has already done that, but I think my version is faster).
If anybody actually look at this, please let me know if I used too many(or too few) __syncthreads().