Parallel bruteforce search. Why this code doesn't work?

Hi All!

I’m trying to implement parallel brute force search.

Main idea: all threads suspects that the word, we searching for, exists at position

offset=(blockIdx.x * blockDim.x) + threadIdx.x;

If the word really exists there - thread writes the offset if it less than previous found (to find first word occurance in the text).

But the following code doesn’t work:

// includes, system

#include <stdlib.h>

#include <stdio.h>

#include <string.h>

#include <math.h>

#include <time.h>

#include <windows.h>

// includes, project

#include <cutil.h>

#include <cuda.h>

#define MEM_SIZE  66559999 // == (((260000-1) * 256) + 255)

#define NUM_BLOCKS  260000

#define NUM_THREADS  256

__device__ int PRESENT(const char *word, const char *text_at_offset, int word_len)

{

	while(word_len)

	{

  if(word[word_len] != text_at_offset[word_len]) return 0; // not present

  word_len--;

	}

	return 1; // present

}

__global__ void BFWrap(const char *d_word, const int *d_word_len, const char *d_text, int *d_word_index)

{

	const int offset = (blockIdx.x * blockDim.x) + threadIdx.x; // 0 to 66559999

	if(offset > (MEM_SIZE - (*d_word_len))) return; // word can't fit in remainder

	int present = PRESENT(d_word, (d_text + offset), (*d_word_len));

	if(present)

	{

  if(offset < (*d_word_index)) *d_word_index = offset;

	}

}

int search_word(const char *h_word, const char *d_text)

{

	int h_word_index	= MEM_SIZE;

	int h_word_len  = strlen(h_word);

	char *d_word  = NULL;

	int  *d_word_len	= NULL;

	int  *d_word_index	= NULL;

	CUDA_SAFE_CALL(cudaMalloc((void**)&d_word, h_word_len));

	CUDA_SAFE_CALL(cudaMalloc((void**)&d_word_len, sizeof(int)));

	CUDA_SAFE_CALL(cudaMalloc((void**)&d_word_index, sizeof(int)));

	CUDA_SAFE_CALL(cudaMemcpy(d_word, h_word, h_word_len, cudaMemcpyHostToDevice));

	CUDA_SAFE_CALL(cudaMemcpy(d_word_len, &h_word_len, sizeof(int), cudaMemcpyHostToDevice));

	CUDA_SAFE_CALL(cudaMemcpy(d_word_index, &h_word_index, sizeof(int), cudaMemcpyHostToDevice));

	// run parallel search!

	BFWrap<<<NUM_BLOCKS, NUM_THREADS>>>(d_word, d_word_len, d_text, d_word_index);

	CUDA_SAFE_CALL(cudaThreadSynchronize());

	// retrieve word index

	CUDA_SAFE_CALL(cudaMemcpy(&h_word_index, d_word_index, sizeof(int), cudaMemcpyDeviceToHost));

	// free memory

	CUDA_SAFE_CALL(cudaFree(d_word));

	CUDA_SAFE_CALL(cudaFree(d_word_len));

	CUDA_SAFE_CALL(cudaFree(d_word_index));

	return h_word_index;

}

int main(int argc, char** argv) 

{

	CUT_DEVICE_INIT(argc, argv);

	FILE *F;

	// load text into device

	char *h_text  = NULL;

	char *d_text  = NULL;

	CUDA_SAFE_CALL( cudaMallocHost( (void**)&h_text, MEM_SIZE ) ); // pinned memory

	CUDA_SAFE_CALL(cudaMalloc((void**)&d_text, MEM_SIZE));

	CUDA_SAFE_CALL(cudaMemset(d_text, 0, MEM_SIZE));

	F = fopen("in.txt", "rb");

	int textSize = fread(h_text, 1, MEM_SIZE, F);

	fclose(F);

	CUDA_SAFE_CALL(cudaMemcpy(d_text, h_text, textSize, cudaMemcpyHostToDevice));

	// run word(s) search

    unsigned int timer = 0;

    float elapsedTimeInMs = 0.0f;

	CUT_SAFE_CALL( cutCreateTimer( &timer ) );

	CUT_SAFE_CALL( cutStartTimer( timer));

	char h_word[]	= "zzz";

	int  h_word_index = search_word(h_word, d_text);

	CUT_SAFE_CALL( cutStopTimer( timer));

    elapsedTimeInMs = cutGetTimerValue( timer);

	// report results

	printf("\n Word '%s' found at %i", h_word, h_word_index);

	printf("\n Elapsed time: %f seconds\n", elapsedTimeInMs / (float)1000);

	// free memory

	CUDA_SAFE_CALL(cudaFreeHost(h_text));

	CUDA_SAFE_CALL(cudaFree(d_text));

	CUT_EXIT(argc, argv);

}

Any ideas what I’m doing wrong?

Could you please advise how to implement fast search using CUDA?

As far as i can understand from your code: Yours is a classic example for the ‘race condition’.

The reason being that all threads in the same warp are trying write to the same memory location. Eg. for this can be:

when two threads in the same warp have satisfied this condition and are ready to write to the same location! Thus, output might be different based on the type of scheduling done the controller!

Thank you. I know about this race condition :)
But for now I receive MEM_SIZE as result in all cases.

After decrease a number of blocks to 60000 - I got a positive result! :)
Thus 60000 blocks and 512/256/128/64/32 threads - works!
I don’t know why this limit of ~60000 blocks exists :(
Do you know?

P.S. I search one unique long word for now. If I search word like ‘aa’ - I see a result of the race condition… Maybe I should use shared memory for the result?

Maybe all of the if statements in your code (for doing the comparisons) are causing your warps to diverge and slowing the code down…

Anyway, here’s a (possibly) faster idea for the search…instead of having to check constantly for the presence of the word, why not:

  • Determine the length of the search target word
  • Sum the values of the characters (i.e. “DOG” = 68 + 79 + 71 = 218)
  • Load a block of your string to search (or the whole thing, if it is relatively short) into shared memory
  • Have each thread compute the sum of N bytes, where N is the length of the search word; thread 0 computes the sum of bytes 0 + 1 + 2 in the string, thread 1 computes 1 + 2 + 3, and so on.
  • Each thread saves its result to a memory cell.
  • Search the memory cells (in serial, on the host) for instances of your original sum value – this tells you about the likely locations for your target search string. Once you compile a list of likely targets, you can go back and check those string positions on the host to find where the string actually exists.

Note: I think this method would only be faster if you had a fairly large string to search (and just put chunks in each block of shared memory), and obviously it only works if your search string is 2 characters or longer.

All CUDA compatible devices will have certain limitations. This can be obtained through:

  cudaDeviceProp prop;

   cudaGetDeviceProperties(&prop,0);

   fprintf(stdout,"Max. size of each dim. of a blk  : %d,%d,%d\n",prop.maxThreadsDim[0],prop.maxThreadsDim[1],prop.maxThreadsDim[2]);

   fprintf(stdout,"Max. size of each dim. of a grid : %d,%d,%d\n",prop.maxGridSize[0],prop.maxGridSize[1],prop.maxGridSize[2]);

Thats why you have that 60,000+ limitation on the number of blocks.

Hope I’ve answered your question :)

And sorry about the ‘MEM_SIZE’ I can’t figure that out! :(

THANKS A LOT!!!

Max. size of each dim. of a blk : 512,512,64

Max. size of each dim. of a grid : 65535,65535,1

So, what is effective limits?

For example, I heared that I should use 256 threads rather than 512 on my 880GTX. Why?

  1. Can you please tell me where did you find this ‘effective limits’ ?

  2. Is it that you are mentioning about the threads per block?

Yes

The effective limits are often the size of the memory. All devices can run 65k*65k blocks.

What you could do is make a 2D grid, and split your 260000 blocks in 2 dimensions dim3(20000,13,1) (and update your index calculation accordingly)

256 threads per block is beneficial because 8800GTX can at most hold 768 threads per multiprocessor. If your blocksize is 512, you will have 512 out of a possible 768 threads per multiprocessor. If your blocksize is 256, you might have 3 blocks running at the same time per multiprocessor (768 out of a possible 768)

@Reid:
Thanks for the information! :)

How to avoid this race condition?

I see no sync mechanism in CUDA :(

One possibility is the use of ‘atomic exchanges’ (Sorry but I haven’t grown that experienced to use these :( )
For your case you can use:
‘atomicMin’ (It is located in “cuda programming guide 2.0” appendix c.1.4)
I think this will solve the problem of race conditions…

__syncThreads()