Puzzle Solver Algorithms

Hi All

I’m learning CUDA and I thought that a fun project to help me learn along the way would be to create an Eternity II puzzle solver (for those of you that don’t know about Enternity 2 Wiki Link -> http://en.wikipedia.org/wiki/Eternity_II_puzzle)

What I’m looking for from the board are some ideas for algorithms that would best utilise my GPU

I have a benchmark to beat, a single core CPU solver written in C++ that can solve a 6x6 test file in 32ms and find all 160 solutions in less than 4 seconds.

Hope to see some good Ideas.

Post that CPU code then, and maybe someone will adapt to CUDA.

This type of algorithm is not the best starting point for CUDA. Start with the standard scans/reductions first, as any brute-force solution will involve those steps (to store optimal results).

Just do not expect others to do your work.

Hi CudaaduC

Link to the code and tile files https://copy.com/bPha1Prmxd0mF4Cj

My initial idea is a tree traversal as inspired by --> http://devblogs.nvidia.com/parallelforall/thinking-parallel-part-ii-tree-traversal-gpu/ I just have to get my head around on how to adapt it to this problem.

I’m not look for others to do the work, I’m just looking for ideas on how a problem like this could be done in CUDA as you have pointed out these type of backtracking algorithms that brute-force solution to this type of problem are not really suited for the GPU.

My other idea was to maybe compute all possible permutations of the tiles on the GPU if possible then validate the combinations again on the GPU but I think I would run into memory issues here as even a small 6x6 has around ~68719 million (Edit: don’t know where my maths was last night but that should be ~3.72^41) combinations.

double post :/

Why is all the code in a .h file?

The general procedure for a brute force problem on a GPU is to have each thread in a block explore some sections of the problem space. Then there must be an update in the global space of an optimal answer and the arrangement responsible to that optimal answer. This can be done via atomic operations or via a block-wise scan-reduction.

For example I have some permutation code which generates all distinct permutations of array indicies, evaluates and caches best answer(s).

Using a GTX 780ti it takes about 8 seconds total to generate all 13! permutations of an array, and it takes another 2 seconds to evaluate all permutations and cache/scan/reduce the optimal answer and permutation responsible for that answer (total 10 seconds).

For 14! it takes 126 seconds for just the generation and 163 seconds total for adding the cache/scan/reduce portion. Once you get to about 16! you are looking at about 10 hours, so anything beyond that will need multiple GPUs.

It is important to understand the difference between permutations, combinations(subsets) and combinations (with repetitions).

If you are ‘brute forcing’ a password for examples that is NOT a permutation problem because there can be repeated values in each location. That type actually takes less time to generate on a GPU (than permutations) because it is easier to derive the n-length candidate because you do not need to force the ‘no repetition’ aspect. For example one can generate,evaluate,scan,reduce 7^16 = 33,232,930,569,601 possibilities of a 16 character password (but with only 7 possibilities of values for each spot) in about 25 minutes on a Tesla K40, and probably under 20 minutes on a GTX 780ti(not tested yet).

for 3.72^41 you are going to need at least 8 GPUs for exhaustively generate/evaluate that many possibilities in a reasonable time. On a multicore CPU it may take years.

So ultimately you are going to need some heuristic to reduce the problem space, and I am sure that is how most people approached the problem.

As Cudaaduc also infers, I would like t think that an algorithm serially implemented (cpu) and the same algorithm parallel implemented (gpu), would more or less still share the same core premises/ functionality

Thus, you have the option of taking the existing cpu algorithm and rewriting it for a parallel framework - gpu
Or, you could come up with a completely new/ different algorithm, and implement that, either for a gpu or cpu
It seems you really wish to attempt the latter
As Cudaaduc also suggests, perhaps the former is the better choice for now
Distill the pseudo code/ functionality of the cpu algortihm already at your disposal, and then rewrite that for a gpu
By the time you are done, you would not only be more familiar with cuda, but also the problem itself, and how it is solved, such that further improving/ advancing the algorithm would come natural
You would also find it much easier to get help with implementing individual parts of the functionality of the existing cpu algorithm
For now, if speed is important to you, perhaps you should run with the premise that, anything a cpu can do, a gpu can do faster; in many cases it is not that far fetched - implying that the same algorithm rewritten for a gpu can already be faster than its cpu equivalent

Otherwise, exploit the fact that a) any puzzle solution would eventually be made up of sub-solutions - each sub-block within the puzzle block would constitute a sub-solution on its own, and that b) any 2 puzzle pieces must match to be put together
You would then build a solution up by noting all the pieces that can go together, which paired pieces put together can constitute a sub-block solution, and which sub-block solutions together can form the overall solution
I think you can have an infinite number of puzzle piece combinations, but not that many puzzle pieces can be paired, and repeatedly paired

but not that many puzzle pieces can be paired, and repeatedly paired

in fact in this puzzle yes all piece have 40 paired at right 40 at left 40 at up
so a 4x4 have like 7684040*40 solution

I was under the very impression that:

“Thus, each piece can be used in only 4 orientations. There are 22 colours, not including the gray edges. Five of those can only be found on border and corner pieces and 17 only on so called inner pieces and the side of the border piece across from the gray colour.”

If a piece only has 4 colours (1 per border) and there are 17 possible colours, how can that many pieces be paired?
Furthermore, you would probably need the statistics as to how the number of possible colours are divided between pieces, to truly know how many pieces can be paired
If I merely look at the image of the puzzle, I can see that not all pieces can be paired

cricri1 just brought to my attention that you also how the border colour criterion and centre piece criterion to further exploit

Also, can the cpu algorithm be multi-threaded, do you think?
A cpu can really only run 8 (or 16) threads concurrently, if the threads are not memory constrained (based on the assumption that any additional threads would not make an individual thread do its work faster, if that particular thread is not memory constrained)
Now, if you run an instance per warp block, I would presume that a gpu can run 16 x 15 threads concurrently (number of blocks allowed per SM, multiplied by the number of SMs)
The cpu/ gpu clock frequencies differ, and the comparison is likely more complex than that, but still…
So, perhaps you can have the gpu beat the cpu, merely by running more ‘threads’

i give you the best code it s the code who win

but i don t think it s help