I have N elements, each element could have M values which is sorted, we need to parallel find longest sequence with same value
For example: we have 3 element sequence : A B C,
A’s values can be 1, 2, 3,
B’s value can be 2, 3, 4,
C’s value can be 4, 5 ,6 ,
then when A, B, C value all is 4, the longest common value sequence length is 3, it is 4,4,4
If C’s value can only be 5, 6, 7, then the longest sequence length is 2, it could be 3,3 or 4,4

How can we parallel do this efficiently? I think it is limited by the max value of M since shared memory is limited. so we can not load too many values to compare.

This requires a 2D table which keeps track of which row (N) has which number present (M).

For example:

int IsPresent[N][M];

In case input rows are of different lengths then row[N][I]:

Each thread N could walk through row N with I and fill table IsPresent for I.

IsPresent[N][ Row[N][I] ] = 1;

All that remains to do is a “run length” on each column M for all N: IsPresent[ALL][RUN LENGTH]

Perhaps it’s better to swap indexes on this algorithm so the run length which could be the most compute intensive part can be done horizontally/sequentially.

However filling IsPresent also seems somewhat memory bound because of it’s 2D dimensional/seeking nature, swapping those indexes would probably also make it more horizontal/sequentially, so this algorithm can be quite fast if turned 90%

Only problem is with IsPresent, for your problem size it would be 256 GB or so, at least for IsPresent.

You have that kind of memory ?

Run length can probably also be done partially, remember the last partial runlength, so that the problem can be divided as to what fits in memory, and continue from last partial sum if any.

N and M are both less than 16 can be done totally within shared

N and M are both larger than 2^20 (10^6) is a problem

N or M large and one of them modest size is a third case

========

I have four possible approaches so far

(1) each thread looks for a unique value e.g. thread 0 of block 0 only looks for value 0

this means that there is one thread per possible value

assuming 128 threads per block

block 0 is only looking for values between 0 and 127, block 1 the next 128 values etc.

As each element in N is sorted each block only has to read part of that N (and can use a binary search to speed that up)

Do all reads from global memory using warp 0 and saving the 32 elements in shared

Each thread then only has to keep track of a single sequence, and only needs to record its length and which N it started in

Can extend this by designing for ILP (Instruction Level Parrallelism) and have each thread search for say 4 values so thread 0 would do values 0,1,2,3. thread 1 the 4,5,6,7, etc. uses more registers but 1/4 the number of blocks

(2) Have a kernel that summarises each N into a smaller table P with each N assigned to a column

then 2nd kernel set up along approach (1) above but using the smaller table to tell it which N’s contain data relevant to that block

so if N[65] will be relevant to block 123 then P[123,65] will be set and P[123,65] through to P[123,95] will be contiguous (or even more if go to bit level )

data in P can be used to imply the possibility of a sequence (or a sequence being broken)

(3) Relational algebra more complex

(4) invert your data and resort

e.g.

A contains 1,2,3,4 B contains 2,3,4,5 C contains 4,5,6,8

becomes

1A,2A,3A,4A 2B,3B,4B,5B, 4C,5C,6C,8C then when sorted becomes

1A,2A,2B,3A…4A,4B,4C…

each block can now scan a big chunk of this and can apply some pattern matching tricks to reduce computation.

Ideal number of blocks probably about 6 per SM in your GPU