As a ‘proof of concept’ for a Magic Square brute force solver for a square matrix 5x5, I implemented a test version which limits the values per location to the set[0,1,2] with the objective of summing the 5 element rows, columns and two diagonals to an input target sum determined by user.

For the case where the target sum is 6 this is the output:

```
Using single GPU GeForce GTX TITAN X
Target number for magic square sum= 6
Total number of distinct 5x5 board arrangements for range 0:2 inclusive(3 possibilites per location)= 847288609443
Lower bound of iterations (not counting constant factors)= 21182215236075
GPU timing= 105607
Optimal score = 12
board=
2 2 0 1 1
2 0 2 0 2
0 1 2 2 1
0 1 1 2 2
2 2 1 1 0
number = 156699320426
```

NOTE: the running time is in milliseconds so that example took just over 105 seconds, or around 1.76 minutes.

The code is posted here:

https://sites.google.com/site/cudamagicsquare/

The problem is that the application is relatively slow when compared to my earlier version for the 4x4 Magic Square which took about 198 seconds to solve a much larger problem (7^16 or over 33 trillion board arrangements vs. only 3^25 or 847 billion board arrangements for this 5x5 limited version)

I even went to great lengths to avoid 64 bit unsigned integer division with such lovely code as this:

```
Arr_zero=S_Arr[threadIdx.x][0]=S_Arr[threadIdx.x][1]=t2.x= (pos>=282429536481ULL) ? (pos>=564859072962ULL ? 2:1):0;//24
pos= t2.x ? (t2.x==2 ? (pos-564859072962ULL):(pos-282429536481ULL)):pos;
S_Arr[threadIdx.x][2]=t2.x= (pos>=94143178827ULL) ? (pos>=188286357654ULL ? 2:1):0;//23
Arr_zero+=t2.x;
pos= t2.x ? (t2.x==2 ? (pos-188286357654ULL):(pos-94143178827ULL)):pos;
```

Which brought down the running time (when compared to actually performing those 64 bit divisions) by about 65%.

This version uses **shared** memory as storage which seemed to be slightly faster than using local registers (each thread would exclusively use a pre-determined chunk of **shared** memory so there would be no contention between threads for **shared** memory).

The plan was to start with a smaller problem space for the 5x5 then move up, but as it stands now it already is taking much more time than expected. Obviously this is a compute bound problem, but most of the compute is 32 bit integer.

This is a test of idiotically trying every possible configuration, as it is clear there are much faster solutions. Just want to figure out the bottleneck in this example case.

Any ideas?

As an aside I found it interesting that for a target value of 5, it did not return the obvious choice of a grid with all ones, rather it returned a different correct configuration:

```
Target number for magic square sum= 5
Total number of distinct 5x5 board arrangements for range 0:2 inclusive(3 possibilites per location)= 847288609443
Lower bound of iterations (not counting constant factors)= 21182215236075
GPU timing= 107770
Optimal score = 12
board=
2 0 0 1 2
1 1 2 0 1
0 1 0 2 2
0 1 2 2 0
2 2 1 0 0
number = 60364459793
```

Which illustrates that there are many such configurations and the parallel nature of the implementation may not return the first occurrence of such a correct solution unless the author adds a heuristic to the code which mandates the return of the ‘first’ such correct configuration.