Simple application not scaling well, trying to figure out reason(s)

As a ‘proof of concept’ for a Magic Square brute force solver for a square matrix 5x5, I implemented a test version which limits the values per location to the set[0,1,2] with the objective of summing the 5 element rows, columns and two diagonals to an input target sum determined by user.

For the case where the target sum is 6 this is the output:

Using single GPU GeForce GTX TITAN X

Target number for magic square sum= 6

Total number of distinct 5x5 board arrangements for range 0:2 inclusive(3 possibilites per location)= 847288609443

Lower bound of iterations (not counting constant factors)= 21182215236075


GPU timing= 105607

Optimal score = 12
board=
2 2 0 1 1
2 0 2 0 2
0 1 2 2 1
0 1 1 2 2
2 2 1 1 0

number = 156699320426

NOTE: the running time is in milliseconds so that example took just over 105 seconds, or around 1.76 minutes.

The code is posted here:

https://sites.google.com/site/cudamagicsquare/

The problem is that the application is relatively slow when compared to my earlier version for the 4x4 Magic Square which took about 198 seconds to solve a much larger problem (7^16 or over 33 trillion board arrangements vs. only 3^25 or 847 billion board arrangements for this 5x5 limited version)

I even went to great lengths to avoid 64 bit unsigned integer division with such lovely code as this:

Arr_zero=S_Arr[threadIdx.x][0]=S_Arr[threadIdx.x][1]=t2.x= (pos>=282429536481ULL) ? (pos>=564859072962ULL ? 2:1):0;//24
pos= t2.x ? (t2.x==2 ? (pos-564859072962ULL):(pos-282429536481ULL)):pos;
		
S_Arr[threadIdx.x][2]=t2.x= (pos>=94143178827ULL) ? (pos>=188286357654ULL ? 2:1):0;//23
Arr_zero+=t2.x;
pos= t2.x ? (t2.x==2 ? (pos-188286357654ULL):(pos-94143178827ULL)):pos;

Which brought down the running time (when compared to actually performing those 64 bit divisions) by about 65%.

This version uses shared memory as storage which seemed to be slightly faster than using local registers (each thread would exclusively use a pre-determined chunk of shared memory so there would be no contention between threads for shared memory).

The plan was to start with a smaller problem space for the 5x5 then move up, but as it stands now it already is taking much more time than expected. Obviously this is a compute bound problem, but most of the compute is 32 bit integer.

This is a test of idiotically trying every possible configuration, as it is clear there are much faster solutions. Just want to figure out the bottleneck in this example case.

Any ideas?

As an aside I found it interesting that for a target value of 5, it did not return the obvious choice of a grid with all ones, rather it returned a different correct configuration:

Target number for magic square sum= 5

Total number of distinct 5x5 board arrangements for range 0:2 inclusive(3 possibilites per location)= 847288609443

Lower bound of iterations (not counting constant factors)= 21182215236075


GPU timing= 107770

Optimal score = 12
board=
2 0 0 1 2
1 1 2 0 1
0 1 0 2 2
0 1 2 2 0
2 2 1 0 0

number = 60364459793

Which illustrates that there are many such configurations and the parallel nature of the implementation may not return the first occurrence of such a correct solution unless the author adds a heuristic to the code which mandates the return of the ‘first’ such correct configuration.

Have you profiled the code?
How about shared memory bank conflicts?

I’d go after:

gld_efficiency,gst_efficiency,shared_replay_overhead

as the first 3 nvprof metrics I would look at

Looking at the code, nothing obvious jumps out at me (the code is hard to read due to lack of indentation and syntax highlighting). Have you compared the generated SASS for the 4x4 and 5x5 versions to see if they look different in unexpected ways? The way the divisions are taken care of looks similar to what you used in the 4x4 case, if memory serves.

I second txbob’s recommendation to compare the profiler stats for the 4x4 and 5x5 case. Since global memory does not seem to come into play I would look at shared memory statistics, as well as metrics related to instruction execution such as dependency stalls and branch divergence.

txbob and njuffa, as usual I appreciate your assistance.

Sorry about the weird code formatting on the Google sites page, Google sites did some horrible formatting to the final page which looked fine when I previewed but users see that monstrosity mass of code.

Here are some of those metrics from nvprof

==5452== Event result:
Invocations                                Event Name         Min         Max         Avg
Device "GeForce GTX TITAN X (0)"
        Kernel: void Five_five_1<unsigned int=131072>(unsigned int*, __int64*, unsigned int
          1                            warps_launched    51714392    51714392    51714392
          1                                    branch  2.6639e+10  2.6639e+10  2.6639e+10

==5452== Metric result:
Invocations                               Metric Name                        Metric Description         Min         Max     Avg
Kernel: void Five_five_1<unsigned int=131072>(unsigned int*, __int64*, unsigned int)
 1                              ipc                              Executed IPC           3.665441    3.665441    3.665441
 1                             inst_per_warp                     Instructions per warp  2.0385e+05  2.0385e+05  2.0385e+05
 1                            gld_efficiency             Global Memory Load Efficiency       0.00%       0.00%       0.00%
 1                            gst_efficiency            Global Memory Store Efficiency      18.75%      18.75%      18.75%
 1      shared_load_transactions_per_request  Shared Memory Load Transactions Per Requ    1.666667    1.666667    1.666667
 1                  shared_load_transactions                  Shared Load Transactions    32321495    32321495    32321495
 1                    shared_load_throughput             Shared Memory Load Throughput  37.517MB/s  37.517MB/s  37.517MB/s
 1                         shared_efficiency                  Shared Memory Efficiency      67.13%      67.13%      67.13%

And I also ran nvvp which had some additional data:

http://i.imgur.com/TkiapOa.png?1

Occupancy is shown as 100%, but not sure fact if that helps much. The global memory operations are trivial, but lots of shared and local register action.

The approach which uses registers instead of shared memory (other than a small amount used for the reducion) is a bit slower than the posted shared approach.

Will profile further and post those results.

100% occupancy and IPC of 3.67 looks good. The shared memory loads aren’t fully efficient, suggesting some bank conflicts? Try some padding of the large shared memory array, e.g. [257][7]). Code does not seem to be limited by shared memory bandwidth unless I am misinterpreting the numbers.

Since you state that the scaling from 4x4 to 5x5 is poor, I would compare the profiler metrics for that as well to see whether there are any significant differences to the 5x5 case.

BTW, what is the time complexity of this algorithm in O() notation? Is it possible that previous estimates are off and increasing dimension from 4 to 5 increases the amount of work much more than your thought initially? Also, based on your puzzlement about some of the solutions found (why not all 1s as the first solution ?), is it possible there is a bug somewhere causing this to run much longer than anticipated?

I haven’t studied the code in detail. But the reason I asked for shared_replay_overhead is that it answers the question “are there bank conflicts?”

We can try to guess at bank conflicts based on shared_load_transactions_per_request, but this can also be skewed by loading 64-bit quantities across a warp, which will also generate two transactions per request. You have a 64-bit shared animal called mask_val, but the warp-wide loads/stores of this seem quite limited. So I’m a little suspicious of shared bank conflicts. Even if there are bank conflicts, the 1.667 transactions per request is not horrible, and sometimes bank conflicts are “unavoidable”. But if your code has a large component of shared loads/stores, and you can drive this number close to the ideal 1.000, then that could provide a significant speed up. 32M shared load transactions does not seem like a lot for an application that runs for ~100 seconds, so this may be a complete waste of time.

The better approach is to methodically assess the limiters. recent versions of nvvp should immediately spit out whether your code is compute-bound or memory-bound. nvvp output you show reports low shared mem utilization and low global mem utilization, so I’m guessing the app is compute bound, probably integer bound. ipc looks not too bad for a Kepler

a general quick assessment of stall/latency being a concern can be assessed with the sm_efficiency metric.

The tested implementation does not attempt to find the first (at least I did not deliberately write code which attempted to find the first). There are multiple correct solutions and I will change the code to find the ‘first’ which should be the same as the smallest candidate 64 bit value which results in that max value.

While a bug is possible I did test against a CPU implementation, and results to match up so far.

I think you are right in that I may have made the erroneous assumptions concerning the performance impact of that increase from 16 board locations to 25.

Thanks for the advice as I quite often learn important lessons from such ‘hobby’ projects.