similar string search possible with CUDA?


I want to do a similar string search.

->Is this possible using CUDA? (can you run such a stuff on a GPU?)

It would work in the following way:

-the user enters words, names, sentences, whatever.
-these words are split up into 3-grams (sub strings of the length 3).
-When something new is entered by the user, the new input is
split up into 3-grams, too, and a previous input with the most
matching 3-grams is searched.


PREVIOUS INPUT: hello dave
3-grams: hel,ell,llo,lo ,o d, da,dav,ave*
3-grams: bye*,ye ,y d, da,dav,ave

CURRENT INPUT: bye grave

3-grams of the current input:
bye,ye ,e g, gr,gra,rav,ave

The second previous input would be chosen as similarest one as
it has 3 matching 3-grams (marked with *), the first input only 1.

Can this be implemented (ingeniously) with CUDA?
I could test many previous answers against the current input at once
using one core per previous input (I think ?!?).

Is there a better way to do the similar search
(using trees or so, if yes, how does this tree look like)?

Thanks a lot,


Has really nobody an idea?
Come on, you are surely all cracks in what concerns CUDA?? ;)
I’m just a beginner.

As long as you have multiple instances of the same exact problem, things should go smoothly. But you would need a lot (thousands+) of previous strings for it to be worth the effort (im guessing).

I would use one thread per previous input.

I could see some problems as to where to store the strings since you cant malloc inside a kernel. Unless you have an upper bound on the size of the input strings.

Consider going down the string alignment path, you may have better luck. A similar thing has been done with the Smith-Waterman Algorithm in CUDA which is used for DNA alignment. Be careful of implementing trees in CUDA since they can grow quickly producing varied branching amongst threads, plus memory access patterns may not be consistent and so you will not be able to take advantage of memory coalescing.

Aaah, now the answers come ;)
Thanks folks!

@Ailleur: ok, no problem, the string length is limited to 32 chars. I have a few million of them (5 M or so).
One thread per string - two hundred threads on a GTX260 - this gonna rock!!!

@yummig: could you please give me some details what a ‘string alignment path’ is? Maybe I know it and just don’t know
the (English) name. Sorry, I’m a newbie :unsure:

Sure, I gave a talk/lecture to some undergraduates recently regarding dynamic programming in this area, see the attatched PDF file. If you still get stuck just give me a shout. I hope this gets you closer to what you want to acheive
Essential_Algorithms.pdf (2.65 MB)

Uh, certainly not. Two hundred threads is the minimum to get something useable, but I doubt it will be worth the effort.

Not sure about GTX260, but on GTX280 a really good value would be

30 processors * 8 blocks per processor * 64 threads per block = 15360 threads

If you want it to scale well with future hardware you are two orders of magnitude off.

Uh, ok! ;)

I thought of 200 threads because this GTX260 is said to have around 200 stream processors.

On my Intel Core 2 Duo you can set up 2 threads, one per core (actually 4 I think because of hyper

threading). Correct me if I’m wrong.

I’ll better read the CUDA documentation before I continue talking…

Thanks, yummig, I’ll read your document…

You will want (hundreds of)thousands of threads for high CUDA performance. In general writing CUDA programs while keeping the number of Multiprocessors in mind is flawed.

CPUs and GPUs are very different architectures. As an extremely crude approximation, you can think each GPU “processor” as being massively hyperthreaded. Specifically, on the GTX 280, each MP (made up of 8 ALUs) can keep 1024 threads in flight. This interleaved execution hides the latency of memory accesses. It is all explained very nicely in the first few chapters of the programming guide.


Ok I checked the document you sent me.
Maybe I have not understood it, but how can I use string alignment for a similar string search?
The aim of the string alignment is to find out what’s missing and what’s surplus within an input compared to a stored dataset.
But what if this dataset consists of a few million items?
I suppose creating the dynamic programming matrix for 5 million items is MUCH slower than comparing a few million 3-grams?
Did I not understand your document or am I right?