Also updated this ‘visit all distinct combinations,evaluate each based on some criteria, and cache the optimal value WITH the optimal combination information’ to handle the 64 bit numbers.
Like with the permutations I had to make some adjustments to the main kernel and reduction/scan steps, size of thread blocks etc…
for N=31, number of subsets=2147483648, number of iterations per thread=31,reduction steps in kernel per thread(conservative estimate)=6,iterations in thrust::max_element=4194304
for a total of 2147483648316 + 4194304 = 399,436,152,832 iterations/operations
time for N=31, 0.79 seconds on the GPU for complete time (including all memory copies,allocations etc)
for N=35, number of subsets=34359738368, number of iterations per thread=35, reduction steps in kernel per thread(conservative estimate)=6, iterations in thrust::max_element=67108864
for a total of 34359738368356 + 67108864= 7,215,612,166,144 iterations/operations
time for N=35, 14.4 seconds on the GPU for complete time (including all memory copies,allocations etc)
Remember that the running time is not only the time taken to generate each distinct combination, but also to evaluate each combination against characteristics/dependencies in constant memory, then cache the optimal result and the unique combination which resulted in that optimal result. Then the final reduction step is done to return the single best value and the list of indices which resulted in that value.
I am sure there are a few more optimizations that can be done to get a speed increase, but this is not bad I think… As far as I can tell no other person/group has made a GPU version of this algorithm, but if anyone here can best this in terms of times let me know.
Verified the results to be accurate but did not compare against the CPU version because I just do not have that much free time to wait.