Also updated this ‘visit all distinct combinations,evaluate each based on some criteria, and cache the optimal value WITH the optimal combination information’ to handle the 64 bit numbers.

Like with the permutations I had to make some adjustments to the main kernel and reduction/scan steps, size of thread blocks etc…

for N=31, number of subsets=2147483648, number of iterations per thread=31,reduction steps in kernel per thread(conservative estimate)=6,iterations in thrust::max_element=4194304

for a total of 2147483648*31*6 + 4194304 = 399,436,152,832 iterations/operations

time for N=31, 0.79 seconds on the GPU for complete time (including all memory copies,allocations etc)

for N=35, number of subsets=34359738368, number of iterations per thread=35, reduction steps in kernel per thread(conservative estimate)=6, iterations in thrust::max_element=67108864

for a total of 34359738368*35*6 + 67108864= 7,215,612,166,144 iterations/operations

time for N=35, 14.4 seconds on the GPU for complete time (including all memory copies,allocations etc)

Remember that the running time is not only the time taken to generate each distinct combination, but also to evaluate each combination against characteristics/dependencies in **constant** memory, then cache the optimal result and the unique combination which resulted in that optimal result. Then the final reduction step is done to return the single best value and the list of indices which resulted in that value.

I am sure there are a few more optimizations that can be done to get a speed increase, but this is not bad I think… As far as I can tell no other person/group has made a GPU version of this algorithm, but if anyone here can best this in terms of times let me know.

Verified the results to be accurate but did not compare against the CPU version because I just do not have that much free time to wait.