Guidance on Parallelizing an Algorithm

I’m looking for guidance on a structural approach to a specific algorithm that I’ve implemented as a serial algorithm.

I have a matrix of integers (range from 1 through 20 or so) , roughly 11,000 rows by 20 columns (each row corresponds to a set of information on persons while each column represents a different piece of information; in other words, I have a set of data on 11,000 persons).

The goal of the algorithm is to find out, for each person, which combinations (if any) of the 20 pieces of information are unique to that person AND where any subcombination of a unique combination is not unique.

Example:
Person A has a unique set of values for columns 1, 2, and 3 BUT Person A also has a unique value for column 1; therefore the combination of the values from columns 1, 2, and 3 is not a combination that I would need to keep track of.

The algorithm works like this:

  1. Examine each of the 20 choose 1 combinations of the 20 columns and identify which values of each combination is unique. Record that information.
  2. Examine each of the 20 choose 2 combinations of the 20 columns. Identify pairs of values for each combination to identify unique pairs and then exclude pairs where any of the 2 choose 1 values are unique. (e.g. the data pair 2,3 may be unique but if 2 is unique (with respect to the other records) then exclude the data pair 2,3)
  3. Examine each of the 20 choose 3 combinations of the 20 columns. Identify triplets that are unique and exclude those triplets if any of the values by themselves are unique or if any pair of the three values is unique.
    … (Continue examining the 20 choose 4 combinations up to the 20 choose 20 combination)
  4. Finally, examine the single 20 chose 20 combination for unique patterns (e.g. all 20 values are unique) but exclude those unique patterns where any of the single values, pairs of values, triplets of values up through sets of 19 values are unique.

As you can imagine, there are lots of computations going on and this algorithm isn’t efficient from a reduction in the number of computation steps. I had originally used trees to create frequency distributions of each column, then frequency distributions of each pair of columns, up through frequency distributions of each set of 19 columns, but the amount of memory was excessive. The approach I settled on has redundant computations but the memory footprint is small.

It takes about 9 hours to do this on a Intel I7 940.

Now, on to hopefully using GPUs.

The data that needs to be examined doesn’t take up much memory so I suppose I should copy it over to the GPU memory.

What I was wondering was what structural approach would work well if I wanted to:

  1. Operate on each of the 2^20 possible combinations in a parallel fashion
  2. I was thinking that each of the combinations would be read from a single copy of the data in GPU global memory and that the GPU assigned to process a given combination would have to apply the algorithm listed above (well the correct step of the algorithm listed above). e.g. if one GPU was given 5 columns from the 20, then the GPU would first identify the records with unique sets of values for the 5 columns and then determine if any of the subcombinations of those 5 values are unique (if so, ignore the 5 way unique value).
  3. As each GPU finished its computation, it would return an 11,000 by 1 array of integers (1 or 0 to indicate if a particular set of values was unique for a given record). The array would be sent to host memory.

The host would need to know which combination of the 20 columns is associated with each array returned from a given GPU.

Any suggestions on how I can keep track of that?

Any other general thoughts about whether or not this algorithm is even worth it to try adjust for use on GPUs?

Regards

p.s. Perhaps it would be possible on the CPU side to instantiate a thread for each of the 2^20 combinations and do it in such a way as to be able to associate thread ID with a particular set of the 20 columns (I use C++ bitsets to do this now but I was concerned about using c++ code in conjunction with nvcc).