if thats for your thesis you should have a look at triangular numbers!
you can output the sums linear and use a 1D array and so one thread can calculate the element at its index
algorithm:
you have n numbers in your input array which lead to n*(n-1)*0.5 output combinations (see triangular formular with 0-based indices)
example n=4:
nOut = 6
direct:
tIdx: 012345
val1: 000112
val2: 123233
or flipped which i calculated in my thesis:
val1: 122333 (row-index)
val2: 001012 (column-index)
so all you have to do is:
calculate the row and column indices of the linearized output triangular matrix in your kernel
(if you have one indices you easily have the other one)
and look with those indices into your array and output the value at the unique thread index
with that unique thread index this algorithm is 100% scalable across all gpu’s and achieves 100% occupancy, so you can calculate it for endless n
I used this approach for a different pairwise triangular matrix problem in my thesis. cluster-algorithm, see page 6
How to calculate row indices
sequence of row indices is:
1223334444 …
formula
row = floor(0.5f + sqrt(0.25f - 2 * (1 - linIdx))); (for 1 based indices)
because of computer precisions in floating point numbers this formula cannot be applied directly to calculate the row, the calculated row needs to be checked up and downwards to be sure
/*
Example: a 0-based index like 496 can result in wrong results
row = floor(0.5f + sqrt(0.25f + 2 * 496));
= floor(0.5f + sqrt(0.25f + 992));
= floor(0.5f + sqrt(992.25f));
= floor(0.5f + 31.5f); or floor(0.5f + 31.49f); <-- if its 31.49 due to precision its rounded down!
= floor(32.0f); or floor(31.99f) <-- can be 31 instead which is wrong! so we need to check +-1
= 32; or 31;
*/
(calculating this is equally fast as a lookup into global memory on a G80, fermi not tested)
With the row indices the corresponding column indices can easily be calculated.
I needed my row indices on cpu side so i calculated them with multiple cpu-threads so i only needed to check up and down once per thread and the rest of the numbers one cpu-thread is responsible of were calculated in a loop.
Easy Alternative:
use a 2D-Block layout so you dont need to calculate the row and column indices, but half of your threads wont be active since your matrix is a triangular matrix or they calculate with repetition. I assume that is what you already did from your thread title.
If you need more help to calculate the row indices i can post my code which would be almost equal to your complete necessary kernel^^