Looking for an example of simple word search in line of text

Hello, I am looking for a Cuda example of the following:

I have 1 million strings, such as webserver access log, I want to test Cuda effectiveness at returning log lines by doing substring matches on those lines.
For example, suppose I have following text:

file.txt:

The quick brown
fox jumps over
the lazy dog

I want to run, ./program file.txt fox, and get

fox jumps over

in stdout.
Note that file.txt can have millions of entries. Is something like that easy to do with Cuda. I am curious about parallel string and substring matching. Basically implemenation of linux/unix grep via Cuda.

Thanks.

-jack

Hello, I am looking for a Cuda example of the following:

I have 1 million strings, such as webserver access log, I want to test Cuda effectiveness at returning log lines by doing substring matches on those lines.
For example, suppose I have following text:

file.txt:

The quick brown
fox jumps over
the lazy dog

I want to run, ./program file.txt fox, and get

fox jumps over

in stdout.
Note that file.txt can have millions of entries. Is something like that easy to do with Cuda. I am curious about parallel string and substring matching. Basically implemenation of linux/unix grep via Cuda.

Thanks.

-jack

Is this a practical question or a “How would I do this in CUDA just as an exercise” question?

Practically, it’s not suited to CUDA just because it’s dominated by the time it takes to read from the disk anyway.
So you may as well use a classic Boyer Moore algorithm on the CPU. In fact disks are so slow that even a dumb bruteforce character by character search on the CPU will still be disk speed limited.

If just as an exercise you’re interested in if/how you’d do it in CUDA, the broad outline of one approach would be to assign each block a chunk of the text, perhaps each block getting 15K. Overlap the chunks a bit (by the length of the test string.) Each block loads its 15K of text into shared memory. Each thread then runs Boyer Moore (or even dumb character by character) sequentially on a subset (ie, thread 0 might do bytes 0-(32+N), thread 1 would look at bytes 32-(64+N) etc. [N is the length of the test string.] If any thread finds a match, it reports it, perhaps by using an atomic add to a result queue saying “found a match at this offset.”

There’s lots of other ways you could organize it… on Fermi you might even skip the whole loading into shared memory and just depend on the L1 cache.

Again, such a tool wouldn’t be useful, but it’d be an interesting programming exercise.

Is this a practical question or a “How would I do this in CUDA just as an exercise” question?

Practically, it’s not suited to CUDA just because it’s dominated by the time it takes to read from the disk anyway.
So you may as well use a classic Boyer Moore algorithm on the CPU. In fact disks are so slow that even a dumb bruteforce character by character search on the CPU will still be disk speed limited.

If just as an exercise you’re interested in if/how you’d do it in CUDA, the broad outline of one approach would be to assign each block a chunk of the text, perhaps each block getting 15K. Overlap the chunks a bit (by the length of the test string.) Each block loads its 15K of text into shared memory. Each thread then runs Boyer Moore (or even dumb character by character) sequentially on a subset (ie, thread 0 might do bytes 0-(32+N), thread 1 would look at bytes 32-(64+N) etc. [N is the length of the test string.] If any thread finds a match, it reports it, perhaps by using an atomic add to a result queue saying “found a match at this offset.”

There’s lots of other ways you could organize it… on Fermi you might even skip the whole loading into shared memory and just depend on the L1 cache.

Again, such a tool wouldn’t be useful, but it’d be an interesting programming exercise.

Here is my attempt at writing a program that runs multiple cores and simply matches a character in a struct string:

It runs and shows the character position of “b” in “small brown fox jumped over lazy dog”, I read back the array after the kernel runs, and get position of 6, e.g.

0 6

1 6

2 6

3 6

4 6

5 6

6 6

7 6

8 6

9 6

And so on. I expect these results, and was very happy to see them. Now When I changed the word in the sentence to xrown, and still searched, for b, I got puzzling results, here they are:

21 0

22 0

23 0

24 0

25 6

26 6

27 6

28 6

29 6

In the middle of the output of the array in element 25, I start getting '6’s again, as if the text in the sentence either was not matched, or somehow was left over from previous run. This is my first program, so I am a bit clueless of what that might be. Could someone please take a look at educate me?

include <stdio.h>

include <cuda.h>

struct myline { // Simple structure to store as string with a length

    char data[512];

    short length ;

};

// Kernel that executes on the CUDA device

global void square_array(char *a, char *b, int W, struct myline mylines, int N)

{

int idx = blockIdx.x * blockDim.x + threadIdx.x;

    if (N < idx)

            return ;

	short foundC = 0 ;

            int length = mylines[idx].length;

            while(foundC < length) {

		if(mylines[idx].data[foundC] == 'b')

			{ W[idx] = foundC; }

	foundC++;

	}

                   // if (a[idx] == b[idx]) 

                   // W[idx]=1;

}

// main routine that executes on the host

int main(void)

{

dim3 threads (512,1,1);

dim3 grid(128,1,1);

const int N = 100; // Number of elements in arrays

struct myline *mylines = (struct myline )malloc(Nsizeof(struct myline));

char *a_h1, *a_h2, *a_d1, *a_d2;

int *w_h, *a_w1;

size_t size = N * sizeof(char);

a_h1 = (char *)malloc(size); // Allocate array on host

a_h2 = (char *)malloc(size); // Allocate array on host

w_h = (int )malloc(Nsizeof(int)); // Allocate array on host

cudaMalloc((void **) &a_d1, size); // Allocate array on device

cudaMalloc((void **) &a_d2, size); // Allocate array on device

cudaMalloc((void **) &a_w1, N*sizeof(int)); // Allocate array on device

struct myline* d_lines;

cudaMalloc((void **) &d_lines, N * sizeof(struct myline)); // Allocate array on device

// Initialize host array and copy it to CUDA device

for (int i=0; i<N; i++)

	{

	memcpy(mylines[i].data, "small brown fox jumped over lazy dog", 36);

	mylines[i].length=36;

	}

for (int i=0; i<N; i++)

 	printf("%s\n", mylines[i].data);

for (int i=0; i<N; i++) a_h1[i] = ‘a’; // each integer is a char

for (int i=0; i<N; i++) a_h2[i] = ‘a’; // each integer is a char, same array

for (int i=0; i<N; i++) w_h[i] = 0; // each integer is a char, same array

cudaMemcpy(a_d1, a_h1, size, cudaMemcpyHostToDevice); // first array

cudaMemcpy(a_d2, a_h2, size, cudaMemcpyHostToDevice); // second array

cudaMemcpy(a_w1, w_h, size, cudaMemcpyHostToDevice); // flag array

cudaMemcpy(d_lines, mylines, N * sizeof(struct myline), cudaMemcpyHostToDevice);

// Do calculation on device:

int block_size = N;

int n_blocks = N/block_size + (N%block_size == 0 ? 0:1);

square_array <<< grid, threads >>> (a_d1,a_d2,a_w1,d_lines,N);

// Retrieve result from device and store it in host array

//free (a_h1); free(a_h2); free(w_h);

cudaMemcpy(a_h1, a_d1, sizeof(char)*N, cudaMemcpyDeviceToHost);

cudaMemcpy(a_h2, a_d2, sizeof(char)*N, cudaMemcpyDeviceToHost);

cudaMemcpy(w_h, a_w1, sizeof(int)*N, cudaMemcpyDeviceToHost);

// Print results

// for (int i=0; i<10; i++) printf(“%d %c\n”, i, a_h1[i]);

// for (int i=0; i<10; i++) printf(“%d %c\n”, i, a_h2[i]);

for (int i=0; i<N; i++) printf(“%d %d\n”, i, w_h[i]);

free(w_h); cudaFree(a_d1); cudaFree(a_d2); cudaFree(a_w1); cudaFree(d_lines);

}

Here is my attempt at writing a program that runs multiple cores and simply matches a character in a struct string:

It runs and shows the character position of “b” in “small brown fox jumped over lazy dog”, I read back the array after the kernel runs, and get position of 6, e.g.

0 6

1 6

2 6

3 6

4 6

5 6

6 6

7 6

8 6

9 6

And so on. I expect these results, and was very happy to see them. Now When I changed the word in the sentence to xrown, and still searched, for b, I got puzzling results, here they are:

21 0

22 0

23 0

24 0

25 6

26 6

27 6

28 6

29 6

In the middle of the output of the array in element 25, I start getting '6’s again, as if the text in the sentence either was not matched, or somehow was left over from previous run. This is my first program, so I am a bit clueless of what that might be. Could someone please take a look at educate me?

include <stdio.h>

include <cuda.h>

struct myline { // Simple structure to store as string with a length

    char data[512];

    short length ;

};

// Kernel that executes on the CUDA device

global void square_array(char *a, char *b, int W, struct myline mylines, int N)

{

int idx = blockIdx.x * blockDim.x + threadIdx.x;

    if (N < idx)

            return ;

	short foundC = 0 ;

            int length = mylines[idx].length;

            while(foundC < length) {

		if(mylines[idx].data[foundC] == 'b')

			{ W[idx] = foundC; }

	foundC++;

	}

                   // if (a[idx] == b[idx]) 

                   // W[idx]=1;

}

// main routine that executes on the host

int main(void)

{

dim3 threads (512,1,1);

dim3 grid(128,1,1);

const int N = 100; // Number of elements in arrays

struct myline *mylines = (struct myline )malloc(Nsizeof(struct myline));

char *a_h1, *a_h2, *a_d1, *a_d2;

int *w_h, *a_w1;

size_t size = N * sizeof(char);

a_h1 = (char *)malloc(size); // Allocate array on host

a_h2 = (char *)malloc(size); // Allocate array on host

w_h = (int )malloc(Nsizeof(int)); // Allocate array on host

cudaMalloc((void **) &a_d1, size); // Allocate array on device

cudaMalloc((void **) &a_d2, size); // Allocate array on device

cudaMalloc((void **) &a_w1, N*sizeof(int)); // Allocate array on device

struct myline* d_lines;

cudaMalloc((void **) &d_lines, N * sizeof(struct myline)); // Allocate array on device

// Initialize host array and copy it to CUDA device

for (int i=0; i<N; i++)

	{

	memcpy(mylines[i].data, "small brown fox jumped over lazy dog", 36);

	mylines[i].length=36;

	}

for (int i=0; i<N; i++)

 	printf("%s\n", mylines[i].data);

for (int i=0; i<N; i++) a_h1[i] = ‘a’; // each integer is a char

for (int i=0; i<N; i++) a_h2[i] = ‘a’; // each integer is a char, same array

for (int i=0; i<N; i++) w_h[i] = 0; // each integer is a char, same array

cudaMemcpy(a_d1, a_h1, size, cudaMemcpyHostToDevice); // first array

cudaMemcpy(a_d2, a_h2, size, cudaMemcpyHostToDevice); // second array

cudaMemcpy(a_w1, w_h, size, cudaMemcpyHostToDevice); // flag array

cudaMemcpy(d_lines, mylines, N * sizeof(struct myline), cudaMemcpyHostToDevice);

// Do calculation on device:

int block_size = N;

int n_blocks = N/block_size + (N%block_size == 0 ? 0:1);

square_array <<< grid, threads >>> (a_d1,a_d2,a_w1,d_lines,N);

// Retrieve result from device and store it in host array

//free (a_h1); free(a_h2); free(w_h);

cudaMemcpy(a_h1, a_d1, sizeof(char)*N, cudaMemcpyDeviceToHost);

cudaMemcpy(a_h2, a_d2, sizeof(char)*N, cudaMemcpyDeviceToHost);

cudaMemcpy(w_h, a_w1, sizeof(int)*N, cudaMemcpyDeviceToHost);

// Print results

// for (int i=0; i<10; i++) printf(“%d %c\n”, i, a_h1[i]);

// for (int i=0; i<10; i++) printf(“%d %c\n”, i, a_h2[i]);

for (int i=0; i<N; i++) printf(“%d %d\n”, i, w_h[i]);

free(w_h); cudaFree(a_d1); cudaFree(a_d2); cudaFree(a_w1); cudaFree(d_lines);

}

Fixed it was missing a break there:

global void square_array(char *a, char *b, int W, struct myline mylines, int N)

{

int idx = blockIdx.x * blockDim.x + threadIdx.x;

    if (N < idx)

            return ;

            short foundC = 0 ;

            int length = mylines[idx].length;

            while(foundC < length-1) {

                    if(mylines[idx].data[foundC]=='a')

                            { W[idx] = foundC; break; } else W[idx] = NULL;

            foundC++;

            }

                   // if (a[idx] == b[idx]) 

                   // W[idx]=1;

}

Hopefully this example is useful to someone :) Now onwards to implement strstr() in kernel.

-Jack

Fixed it was missing a break there:

global void square_array(char *a, char *b, int W, struct myline mylines, int N)

{

int idx = blockIdx.x * blockDim.x + threadIdx.x;

    if (N < idx)

            return ;

            short foundC = 0 ;

            int length = mylines[idx].length;

            while(foundC < length-1) {

                    if(mylines[idx].data[foundC]=='a')

                            { W[idx] = foundC; break; } else W[idx] = NULL;

            foundC++;

            }

                   // if (a[idx] == b[idx]) 

                   // W[idx]=1;

}

Hopefully this example is useful to someone :) Now onwards to implement strstr() in kernel.

-Jack

For those who are interested, here is the implementation of the Knuth-Morris-Pratt extended brute force algorithm, the idea is that if you have lots log lines in a while, you can potentially load a structured array into GPU memory, and do massive parallel search in your GPU video card. As you can see below is the example of the kernel that takes, needle and haystack variables, here ‘abc’ is a needle, and haystack is just any sort of line of text you could imagine, for my purposes I used a 50,000 text file, which I loaded as an array first into main RAM, then moved it into GPU global RAM, (file on disk is 2MB), from there I simply started my kernel, with N = 50000 possible threads;

Potentially if you load 100 MB file into GPU memory I envision seeing a benefit of doing simply entity extraction tasks such as scanning for IP addresses or browser user Agents. This is especially adaptable in hadoop/map-reduce environments, where each MAP can launch a stream task and offload data into your GPU for processing, while keeping your CPUs busy with other tasks.

Here, you can see that pattern does match

input file:                                                    

49983 abc small brown fox jumped over lazy dog

49984 abc small brown fox jumped over lazy dog

49985 abc small brown fox jumped over lazy dog

49986 small brown fox jumped over lazy dog

49987 small brown fox jumped over lazy dog

49988 small brown fox jumped over lazy dog

49989 small brown fox jumped over lazy dog

49990 small brown fox jumped over lazy dog

49991 abc small brown fox jumped over lazy dog

49992 small brown fox jumped over lazy dog

49993 small brown fox jumped over lazy dog

49994 small brown fox jumped over lazy dog

49995 small brown fox jumped over lazy dog

49996 abc small brown fox jumped over lazy dog

49997 abc small brown fox jumped over lazy dog

49998 abc small brown fox jumped over lazy dog

output:

49993 999

49994 999

49995 999

49996 0

49997 0

49998 0

49999 0

50000 0

50001 999

50002 0

50003 0

50004 0

50005 0

50006 999

50007 999

50008 999

50009 999

50010 999

__global__ void search(char *needle, int *W, int needlelen, struct myline* mylines, int N)

{

  int idx = blockIdx.x * blockDim.x + threadIdx.x;

        if (N < idx)

                return ;

int i, j, Next[1000];

                i = 0;

                j = Next[0] = -1;

                needlelen = 3;

                memcpy(needle, "abc", 3);

int linelen = mylines[idx].length;

                char *haystack = mylines[idx].data;

while (i < needlelen)

                        {

                        while (j > -1 && needle[i] != needle[j])

                        j = Next[j];

                        Next[++i] = ++j;

                        }

                /* Searching */

                i = j = 0;

                while (j < linelen)

                {

                        while (i > -1 && needle[i] != haystack[j])

                        i = Next[i];

                        i++;

                        j++;

                        if (i >= needlelen)

                        {

                        W[idx]=999;

                        i = Next[i]; break;

                        } else W[idx] = NULL;

                }

}

For those who are interested, here is the implementation of the Knuth-Morris-Pratt extended brute force algorithm, the idea is that if you have lots log lines in a while, you can potentially load a structured array into GPU memory, and do massive parallel search in your GPU video card. As you can see below is the example of the kernel that takes, needle and haystack variables, here ‘abc’ is a needle, and haystack is just any sort of line of text you could imagine, for my purposes I used a 50,000 text file, which I loaded as an array first into main RAM, then moved it into GPU global RAM, (file on disk is 2MB), from there I simply started my kernel, with N = 50000 possible threads;

Potentially if you load 100 MB file into GPU memory I envision seeing a benefit of doing simply entity extraction tasks such as scanning for IP addresses or browser user Agents. This is especially adaptable in hadoop/map-reduce environments, where each MAP can launch a stream task and offload data into your GPU for processing, while keeping your CPUs busy with other tasks.

Here, you can see that pattern does match

input file:                                                    

49983 abc small brown fox jumped over lazy dog

49984 abc small brown fox jumped over lazy dog

49985 abc small brown fox jumped over lazy dog

49986 small brown fox jumped over lazy dog

49987 small brown fox jumped over lazy dog

49988 small brown fox jumped over lazy dog

49989 small brown fox jumped over lazy dog

49990 small brown fox jumped over lazy dog

49991 abc small brown fox jumped over lazy dog

49992 small brown fox jumped over lazy dog

49993 small brown fox jumped over lazy dog

49994 small brown fox jumped over lazy dog

49995 small brown fox jumped over lazy dog

49996 abc small brown fox jumped over lazy dog

49997 abc small brown fox jumped over lazy dog

49998 abc small brown fox jumped over lazy dog

output:

49993 999

49994 999

49995 999

49996 0

49997 0

49998 0

49999 0

50000 0

50001 999

50002 0

50003 0

50004 0

50005 0

50006 999

50007 999

50008 999

50009 999

50010 999

__global__ void search(char *needle, int *W, int needlelen, struct myline* mylines, int N)

{

  int idx = blockIdx.x * blockDim.x + threadIdx.x;

        if (N < idx)

                return ;

int i, j, Next[1000];

                i = 0;

                j = Next[0] = -1;

                needlelen = 3;

                memcpy(needle, "abc", 3);

int linelen = mylines[idx].length;

                char *haystack = mylines[idx].data;

while (i < needlelen)

                        {

                        while (j > -1 && needle[i] != needle[j])

                        j = Next[j];

                        Next[++i] = ++j;

                        }

                /* Searching */

                i = j = 0;

                while (j < linelen)

                {

                        while (i > -1 && needle[i] != haystack[j])

                        i = Next[i];

                        i++;

                        j++;

                        if (i >= needlelen)

                        {

                        W[idx]=999;

                        i = Next[i]; break;

                        } else W[idx] = NULL;

                }

}