CUDA version of STL next_permutation()

Here is a link to my CUDA version of the C++ Standard Template Library’s next_permutation() function.

There are two kernels, one which generates each permutation of n elements, and one which also evaluates each permutation, caches the optimal result(and the permutation which gave that result). This is done via a reduction/scan which certainly can be improved.

I have continued to test generating all the larger number of permutations from 13! and on. For just generating an array in device memory of all 13! permutations of indices it takes (6,227,020,800 threads times 13+(13*13))= 1.1333178e+12 steps(iterations).

For generating each unique permutation array in device memory which has the indices of all the possible 13! permutations it took my CUDA kernel 2 seconds. I do not have the time to wait for the CPU STL next_permutation() version but I know it will take at quite a bit longer, at least 100x.

UPDATE: Got a 30-40% speedup, check latest code.

Using the nvvp profiling output, managed to get a surprising boost in performance for the generation/evaluation of permuted arrays for larger sizes(13 and up).

Currently in the process of deleting my Github repositories, so if anyone is interested in this problem now would be the time to take a look.

The new version was able to Generate all 16! permutations of an local array, evaluate each permutation, scan/reduce and return optimal answer and a permutation responsible for that answer in 17.49 hours.

Not great given that a similar problem with a larger problem space:

only took 25 minutes to generate, evaluate, scan…etc. Given the evaluation step is easier, but still…

That problem has 33,232,930,569,601 possible arrangements while 16! is only 20,922,789,888,000 possible arrangements.

When I profile the smaller incarnations of the code in nvvp it comes out very positive (‘no issues’), so am mystified by the huge time difference. It will take some time to figure this out, so going to pull it soon.