Why is all the code in a .h file?

The general procedure for a brute force problem on a GPU is to have each thread in a block explore some sections of the problem space. Then there must be an update in the global space of an optimal answer and the arrangement responsible to that optimal answer. This can be done via atomic operations or via a block-wise scan-reduction.

For example I have some permutation code which generates all distinct permutations of array indicies, evaluates and caches best answer(s).

Using a GTX 780ti it takes about 8 seconds total to generate all 13! permutations of an array, and it takes another 2 seconds to evaluate all permutations and cache/scan/reduce the optimal answer and permutation responsible for that answer (total 10 seconds).

For 14! it takes 126 seconds for just the generation and 163 seconds total for adding the cache/scan/reduce portion. Once you get to about 16! you are looking at about 10 hours, so anything beyond that will need multiple GPUs.

It is important to understand the difference between permutations, combinations(subsets) and combinations (with repetitions).

If you are ‘brute forcing’ a password for examples that is NOT a permutation problem because there can be repeated values in each location. That type actually takes less time to generate on a GPU (than permutations) because it is easier to derive the n-length candidate because you do not need to force the ‘no repetition’ aspect. For example one can generate,evaluate,scan,reduce 7^16 = 33,232,930,569,601 possibilities of a 16 character password (but with only 7 possibilities of values for each spot) in about 25 minutes on a Tesla K40, and probably under 20 minutes on a GTX 780ti(not tested yet).

for 3.72^41 you are going to need at least 8 GPUs for exhaustively generate/evaluate that many possibilities in a reasonable time. On a multicore CPU it may take years.

So ultimately you are going to need some heuristic to reduce the problem space, and I am sure that is how most people approached the problem.