String matching optimization

Hi all,

time ago I developed a binary string matching program on cuda it work fine but I want to know if it is possible to optimize code.

class TCmpLookUpCuda4G

{

public:

typedef unsigned char      TByte;

TCmpLookUpCuda4G();

        ~TCmpLookUpCuda4G();

void Init(TGCmpLookUp<unsigned int,bit_array<__int64> > &cl);

//Unary functor to manage a DNF minterm

class DNFFunctor4 : public thrust::unary_function<int,int>

  {

  public:

		

	//constructor callable only in Host

	DNFFunctor4()

	{

	Table=NULL;

	

	unsigned char m8[8];

        cudaMalloc((void**)&mask8,8); 		

        GLBNEW(mask8,"mask8");

m8[0]=1;

        m8[1]=2;

        m8[2]=4;

        m8[3]=8;

        m8[4]=16;

        m8[5]=32;

        m8[6]=64;

        m8[7]=128;

	cudaMemcpy(mask8, m8, 8,cudaMemcpyHostToDevice); 

	}

	

	~DNFFunctor4()

	{

	cudaFree(mask8);

	GLBDELETE(mask8);

	}

		

	void Reset()

	{

	cPos.Reset();

	}

		

  unsigned int Mul;

		

  //global data accessible into the device

  unsigned char	*Table;

  unsigned char *mask8;

unsigned char *BasePtr;

TVPosCuda	cPos;

//-----------------------------MAIN ROUTINE--------------------------------------    

    __device__

  inline int operator()(unsigned int i)

		{ 

		

		unsigned char b;

		unsigned int ind;

		

		

		i*=Mul;

		ind=BasePtr[i]+(BasePtr[i+1]<<8)+(BasePtr[i+2]<<16)+(BasePtr[i+3]<<24);

		b=Table[ind>>3];

		if (b&mask8[ind&7]) 

			{

			

			cPos.AddPos(i);

			return 1;

			}

		else

			return 0;

		}

};

//return the first matching step

bool Check(unsigned char *BasePtr,int begin,int end);

void GetLastPos(unsigned int *vPos,int MaxSize,unsigned int *Size,bool *PossibleExceed)

	{

	Fnc.cPos.GetLastPos(vPos,MaxSize,Size,PossibleExceed);

	}

int GetWordStep();

private:

int	WordStep;

DNFFunctor4 Fnc;

};

The class TCmpLookUpCuda4G has an operator() called in a simple for_each loop by Cuda .

The operator function search in the binary string “BasePtr” if there is a matching word using a lookup table “Table”.

The “Table” is a boolean vector implemented using bytes vector where each bit is the boolean value for a word index of 32 bit size so Table is a vector of 4Gbits ~= 500MBytes.

//1

		i*=Mul;

		

		//2

		ind=BasePtr[i]+(BasePtr[i+1]<<8)+(BasePtr[i+2]<<16)+(BasePtr[i+3]<<24);

		

		//3

		b=Table[ind>>3];

		

		//4

		if (b&mask8[ind&7])

The algorithm use a complexity of O(M/N) so it search only every “Mul” steps in the “BasePtr” string if there is a matching word.

In the point //2 construct the word from the BasePtr and probably a Cuda expert can suggest how to optimize this row .

In the point //3 I access the byte of the boolean vector and using a bit mask “mask8” in //4 I can watch the correct bit. Here probably I can optimize a little bit using mask8 not in a global memory because I need a readonly memory.

Are there other optimization possibilities?

I am using global memory for all the vectors but only the BasePtr is read/write vector , the “Table” and the “mask8” are readonly memory so here is possible to speedup using other type of memory .

Thank you in advance for your suggestions