Currently in the process of deleting my Github repositories, so if anyone is interested in this problem now would be the time to take a look.

The new version was able to Generate all 16! permutations of an local array, evaluate each permutation, scan/reduce and return optimal answer and a permutation responsible for that answer in 17.49 hours.

Not great given that a similar problem with a larger problem space:

https://github.com/OlegKonings/CUDA_Matrix_Sum_Game

only took 25 minutes to generate, evaluate, scan…etc. Given the evaluation step is easier, but still…

That problem has 33,232,930,569,601 possible arrangements while 16! is only 20,922,789,888,000 possible arrangements.

When I profile the smaller incarnations of the code in nvvp it comes out very positive (‘no issues’), so am mystified by the huge time difference. It will take some time to figure this out, so going to pull it soon.