As a ‘proof of concept’ for a Magic Square brute force solver for a square matrix 5x5, I implemented a test version which limits the values per location to the set[0,1,2] with the objective of summing the 5 element rows, columns and two diagonals to an input target sum determined by user.
For the case where the target sum is 6 this is the output:
Using single GPU GeForce GTX TITAN X Target number for magic square sum= 6 Total number of distinct 5x5 board arrangements for range 0:2 inclusive(3 possibilites per location)= 847288609443 Lower bound of iterations (not counting constant factors)= 21182215236075 GPU timing= 105607 Optimal score = 12 board= 2 2 0 1 1 2 0 2 0 2 0 1 2 2 1 0 1 1 2 2 2 2 1 1 0 number = 156699320426
NOTE: the running time is in milliseconds so that example took just over 105 seconds, or around 1.76 minutes.
The code is posted here:
The problem is that the application is relatively slow when compared to my earlier version for the 4x4 Magic Square which took about 198 seconds to solve a much larger problem (7^16 or over 33 trillion board arrangements vs. only 3^25 or 847 billion board arrangements for this 5x5 limited version)
I even went to great lengths to avoid 64 bit unsigned integer division with such lovely code as this:
Arr_zero=S_Arr[threadIdx.x]=S_Arr[threadIdx.x]=t2.x= (pos>=282429536481ULL) ? (pos>=564859072962ULL ? 2:1):0;//24 pos= t2.x ? (t2.x==2 ? (pos-564859072962ULL):(pos-282429536481ULL)):pos; S_Arr[threadIdx.x]=t2.x= (pos>=94143178827ULL) ? (pos>=188286357654ULL ? 2:1):0;//23 Arr_zero+=t2.x; pos= t2.x ? (t2.x==2 ? (pos-188286357654ULL):(pos-94143178827ULL)):pos;
Which brought down the running time (when compared to actually performing those 64 bit divisions) by about 65%.
This version uses shared memory as storage which seemed to be slightly faster than using local registers (each thread would exclusively use a pre-determined chunk of shared memory so there would be no contention between threads for shared memory).
The plan was to start with a smaller problem space for the 5x5 then move up, but as it stands now it already is taking much more time than expected. Obviously this is a compute bound problem, but most of the compute is 32 bit integer.
This is a test of idiotically trying every possible configuration, as it is clear there are much faster solutions. Just want to figure out the bottleneck in this example case.
As an aside I found it interesting that for a target value of 5, it did not return the obvious choice of a grid with all ones, rather it returned a different correct configuration:
Target number for magic square sum= 5 Total number of distinct 5x5 board arrangements for range 0:2 inclusive(3 possibilites per location)= 847288609443 Lower bound of iterations (not counting constant factors)= 21182215236075 GPU timing= 107770 Optimal score = 12 board= 2 0 0 1 2 1 1 2 0 1 0 1 0 2 2 0 1 2 2 0 2 2 1 0 0 number = 60364459793
Which illustrates that there are many such configurations and the parallel nature of the implementation may not return the first occurrence of such a correct solution unless the author adds a heuristic to the code which mandates the return of the ‘first’ such correct configuration.