Hello, I am looking for a Cuda example of the following:
I have 1 million strings, such as webserver access log, I want to test Cuda effectiveness at returning log lines by doing substring matches on those lines.
For example, suppose I have following text:
file.txt:
The quick brown
fox jumps over
the lazy dog
I want to run, ./program file.txt fox, and get
fox jumps over
in stdout.
Note that file.txt can have millions of entries. Is something like that easy to do with Cuda. I am curious about parallel string and substring matching. Basically implemenation of linux/unix grep via Cuda.
Hello, I am looking for a Cuda example of the following:
I have 1 million strings, such as webserver access log, I want to test Cuda effectiveness at returning log lines by doing substring matches on those lines.
For example, suppose I have following text:
file.txt:
The quick brown
fox jumps over
the lazy dog
I want to run, ./program file.txt fox, and get
fox jumps over
in stdout.
Note that file.txt can have millions of entries. Is something like that easy to do with Cuda. I am curious about parallel string and substring matching. Basically implemenation of linux/unix grep via Cuda.
Is this a practical question or a “How would I do this in CUDA just as an exercise” question?
Practically, it’s not suited to CUDA just because it’s dominated by the time it takes to read from the disk anyway.
So you may as well use a classic Boyer Moore algorithm on the CPU. In fact disks are so slow that even a dumb bruteforce character by character search on the CPU will still be disk speed limited.
If just as an exercise you’re interested in if/how you’d do it in CUDA, the broad outline of one approach would be to assign each block a chunk of the text, perhaps each block getting 15K. Overlap the chunks a bit (by the length of the test string.) Each block loads its 15K of text into shared memory. Each thread then runs Boyer Moore (or even dumb character by character) sequentially on a subset (ie, thread 0 might do bytes 0-(32+N), thread 1 would look at bytes 32-(64+N) etc. [N is the length of the test string.] If any thread finds a match, it reports it, perhaps by using an atomic add to a result queue saying “found a match at this offset.”
There’s lots of other ways you could organize it… on Fermi you might even skip the whole loading into shared memory and just depend on the L1 cache.
Again, such a tool wouldn’t be useful, but it’d be an interesting programming exercise.
Is this a practical question or a “How would I do this in CUDA just as an exercise” question?
Practically, it’s not suited to CUDA just because it’s dominated by the time it takes to read from the disk anyway.
So you may as well use a classic Boyer Moore algorithm on the CPU. In fact disks are so slow that even a dumb bruteforce character by character search on the CPU will still be disk speed limited.
If just as an exercise you’re interested in if/how you’d do it in CUDA, the broad outline of one approach would be to assign each block a chunk of the text, perhaps each block getting 15K. Overlap the chunks a bit (by the length of the test string.) Each block loads its 15K of text into shared memory. Each thread then runs Boyer Moore (or even dumb character by character) sequentially on a subset (ie, thread 0 might do bytes 0-(32+N), thread 1 would look at bytes 32-(64+N) etc. [N is the length of the test string.] If any thread finds a match, it reports it, perhaps by using an atomic add to a result queue saying “found a match at this offset.”
There’s lots of other ways you could organize it… on Fermi you might even skip the whole loading into shared memory and just depend on the L1 cache.
Again, such a tool wouldn’t be useful, but it’d be an interesting programming exercise.
Here is my attempt at writing a program that runs multiple cores and simply matches a character in a struct string:
It runs and shows the character position of “b” in “small brown fox jumped over lazy dog”, I read back the array after the kernel runs, and get position of 6, e.g.
0 6
1 6
2 6
3 6
4 6
5 6
6 6
7 6
8 6
9 6
And so on. I expect these results, and was very happy to see them. Now When I changed the word in the sentence to xrown, and still searched, for b, I got puzzling results, here they are:
21 0
22 0
23 0
24 0
25 6
26 6
27 6
28 6
29 6
In the middle of the output of the array in element 25, I start getting '6’s again, as if the text in the sentence either was not matched, or somehow was left over from previous run. This is my first program, so I am a bit clueless of what that might be. Could someone please take a look at educate me?
Here is my attempt at writing a program that runs multiple cores and simply matches a character in a struct string:
It runs and shows the character position of “b” in “small brown fox jumped over lazy dog”, I read back the array after the kernel runs, and get position of 6, e.g.
0 6
1 6
2 6
3 6
4 6
5 6
6 6
7 6
8 6
9 6
And so on. I expect these results, and was very happy to see them. Now When I changed the word in the sentence to xrown, and still searched, for b, I got puzzling results, here they are:
21 0
22 0
23 0
24 0
25 6
26 6
27 6
28 6
29 6
In the middle of the output of the array in element 25, I start getting '6’s again, as if the text in the sentence either was not matched, or somehow was left over from previous run. This is my first program, so I am a bit clueless of what that might be. Could someone please take a look at educate me?
For those who are interested, here is the implementation of the Knuth-Morris-Pratt extended brute force algorithm, the idea is that if you have lots log lines in a while, you can potentially load a structured array into GPU memory, and do massive parallel search in your GPU video card. As you can see below is the example of the kernel that takes, needle and haystack variables, here ‘abc’ is a needle, and haystack is just any sort of line of text you could imagine, for my purposes I used a 50,000 text file, which I loaded as an array first into main RAM, then moved it into GPU global RAM, (file on disk is 2MB), from there I simply started my kernel, with N = 50000 possible threads;
Potentially if you load 100 MB file into GPU memory I envision seeing a benefit of doing simply entity extraction tasks such as scanning for IP addresses or browser user Agents. This is especially adaptable in hadoop/map-reduce environments, where each MAP can launch a stream task and offload data into your GPU for processing, while keeping your CPUs busy with other tasks.
Here, you can see that pattern does match
input file:
49983 abc small brown fox jumped over lazy dog
49984 abc small brown fox jumped over lazy dog
49985 abc small brown fox jumped over lazy dog
49986 small brown fox jumped over lazy dog
49987 small brown fox jumped over lazy dog
49988 small brown fox jumped over lazy dog
49989 small brown fox jumped over lazy dog
49990 small brown fox jumped over lazy dog
49991 abc small brown fox jumped over lazy dog
49992 small brown fox jumped over lazy dog
49993 small brown fox jumped over lazy dog
49994 small brown fox jumped over lazy dog
49995 small brown fox jumped over lazy dog
49996 abc small brown fox jumped over lazy dog
49997 abc small brown fox jumped over lazy dog
49998 abc small brown fox jumped over lazy dog
output:
49993 999
49994 999
49995 999
49996 0
49997 0
49998 0
49999 0
50000 0
50001 999
50002 0
50003 0
50004 0
50005 0
50006 999
50007 999
50008 999
50009 999
50010 999
__global__ void search(char *needle, int *W, int needlelen, struct myline* mylines, int N)
{
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (N < idx)
return ;
int i, j, Next[1000];
i = 0;
j = Next[0] = -1;
needlelen = 3;
memcpy(needle, "abc", 3);
int linelen = mylines[idx].length;
char *haystack = mylines[idx].data;
while (i < needlelen)
{
while (j > -1 && needle[i] != needle[j])
j = Next[j];
Next[++i] = ++j;
}
/* Searching */
i = j = 0;
while (j < linelen)
{
while (i > -1 && needle[i] != haystack[j])
i = Next[i];
i++;
j++;
if (i >= needlelen)
{
W[idx]=999;
i = Next[i]; break;
} else W[idx] = NULL;
}
}
For those who are interested, here is the implementation of the Knuth-Morris-Pratt extended brute force algorithm, the idea is that if you have lots log lines in a while, you can potentially load a structured array into GPU memory, and do massive parallel search in your GPU video card. As you can see below is the example of the kernel that takes, needle and haystack variables, here ‘abc’ is a needle, and haystack is just any sort of line of text you could imagine, for my purposes I used a 50,000 text file, which I loaded as an array first into main RAM, then moved it into GPU global RAM, (file on disk is 2MB), from there I simply started my kernel, with N = 50000 possible threads;
Potentially if you load 100 MB file into GPU memory I envision seeing a benefit of doing simply entity extraction tasks such as scanning for IP addresses or browser user Agents. This is especially adaptable in hadoop/map-reduce environments, where each MAP can launch a stream task and offload data into your GPU for processing, while keeping your CPUs busy with other tasks.
Here, you can see that pattern does match
input file:
49983 abc small brown fox jumped over lazy dog
49984 abc small brown fox jumped over lazy dog
49985 abc small brown fox jumped over lazy dog
49986 small brown fox jumped over lazy dog
49987 small brown fox jumped over lazy dog
49988 small brown fox jumped over lazy dog
49989 small brown fox jumped over lazy dog
49990 small brown fox jumped over lazy dog
49991 abc small brown fox jumped over lazy dog
49992 small brown fox jumped over lazy dog
49993 small brown fox jumped over lazy dog
49994 small brown fox jumped over lazy dog
49995 small brown fox jumped over lazy dog
49996 abc small brown fox jumped over lazy dog
49997 abc small brown fox jumped over lazy dog
49998 abc small brown fox jumped over lazy dog
output:
49993 999
49994 999
49995 999
49996 0
49997 0
49998 0
49999 0
50000 0
50001 999
50002 0
50003 0
50004 0
50005 0
50006 999
50007 999
50008 999
50009 999
50010 999
__global__ void search(char *needle, int *W, int needlelen, struct myline* mylines, int N)
{
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (N < idx)
return ;
int i, j, Next[1000];
i = 0;
j = Next[0] = -1;
needlelen = 3;
memcpy(needle, "abc", 3);
int linelen = mylines[idx].length;
char *haystack = mylines[idx].data;
while (i < needlelen)
{
while (j > -1 && needle[i] != needle[j])
j = Next[j];
Next[++i] = ++j;
}
/* Searching */
i = j = 0;
while (j < linelen)
{
while (i > -1 && needle[i] != haystack[j])
i = Next[i];
i++;
j++;
if (i >= needlelen)
{
W[idx]=999;
i = Next[i]; break;
} else W[idx] = NULL;
}
}