Word stemming, good match for the GPU?

Looking at the classical Porter Stemming algorithm (the C version can be found here: http://tartarus.org/martin/PorterStemmer/c.txt) I am looking to adapt it to the GPU. It seems there are a few avenues that can be taken but it still doesn’t scream out as a perfect match.

Any thoughts on this one?

How are you approaching the parallelism? The simplest approach would be to load a large set of words into memory, and let each thread handle one word. The variable sized inputs and outputs will be a bit of a challenge. Sorting the words by length, so blocks are likely to have the same sized input words will help. This doesn’t solve the output problem, since same sized input words will have differing lengths of output stems. Perhaps staging output in shared memory will solve that. (Or just padding out the output words with nulls to ensure the output is uniform in size.)

Sorting is out of the question due to the data format and delivery constraints. I have given it some thought and don’t think that the size of inputs and outputs really will matter. I’m going to attack the problem from a different angle. I’ll let you know how it goes, thanks for the feedback.