My first run actually only took 47 minutes, which was less than I expected given the huge number of calculations.

The problem I solved was one where you have a 4x4 grid with numbers from 0 to 6 inclusive. The goal is to determine which board arrangement results in the most rows+columns+diagonals which sum to exactly ten.

So there are 7^16 possible arrangements, but I have to generate each in local memory, determine the score, cache the warp best results, cache the block best results, reduce scan etc.

This was my result board:

0 0 5 5

1 6 0 3

4 0 4 2

5 4 1 0

And here is a pastebin of the beta-version first try code:

http://pastebin.com/vETn9cPY

Note: that will be only posted for 24 hours

I used techniques discussed by many of the members of this board, and think it is correct code, but who knows?

My next version will do exactly as njuffa recommends, which would be calculating the powers of 7 in the thread, rather than reading from cache.

Any recommendations for further speedup would be appreciated, as this is my first attempt (just got the problem yesterday).