Hi all,
time ago I developed a binary string matching program on cuda it work fine but I want to know if it is possible to optimize code.
class TCmpLookUpCuda4G
{
public:
typedef unsigned char TByte;
TCmpLookUpCuda4G();
~TCmpLookUpCuda4G();
void Init(TGCmpLookUp<unsigned int,bit_array<__int64> > &cl);
//Unary functor to manage a DNF minterm
class DNFFunctor4 : public thrust::unary_function<int,int>
{
public:
//constructor callable only in Host
DNFFunctor4()
{
Table=NULL;
unsigned char m8[8];
cudaMalloc((void**)&mask8,8);
GLBNEW(mask8,"mask8");
m8[0]=1;
m8[1]=2;
m8[2]=4;
m8[3]=8;
m8[4]=16;
m8[5]=32;
m8[6]=64;
m8[7]=128;
cudaMemcpy(mask8, m8, 8,cudaMemcpyHostToDevice);
}
~DNFFunctor4()
{
cudaFree(mask8);
GLBDELETE(mask8);
}
void Reset()
{
cPos.Reset();
}
unsigned int Mul;
//global data accessible into the device
unsigned char *Table;
unsigned char *mask8;
unsigned char *BasePtr;
TVPosCuda cPos;
//-----------------------------MAIN ROUTINE--------------------------------------
__device__
inline int operator()(unsigned int i)
{
unsigned char b;
unsigned int ind;
i*=Mul;
ind=BasePtr[i]+(BasePtr[i+1]<<8)+(BasePtr[i+2]<<16)+(BasePtr[i+3]<<24);
b=Table[ind>>3];
if (b&mask8[ind&7])
{
cPos.AddPos(i);
return 1;
}
else
return 0;
}
};
//return the first matching step
bool Check(unsigned char *BasePtr,int begin,int end);
void GetLastPos(unsigned int *vPos,int MaxSize,unsigned int *Size,bool *PossibleExceed)
{
Fnc.cPos.GetLastPos(vPos,MaxSize,Size,PossibleExceed);
}
int GetWordStep();
private:
int WordStep;
DNFFunctor4 Fnc;
};
The class TCmpLookUpCuda4G has an operator() called in a simple for_each loop by Cuda .
The operator function search in the binary string “BasePtr” if there is a matching word using a lookup table “Table”.
The “Table” is a boolean vector implemented using bytes vector where each bit is the boolean value for a word index of 32 bit size so Table is a vector of 4Gbits ~= 500MBytes.
//1
i*=Mul;
//2
ind=BasePtr[i]+(BasePtr[i+1]<<8)+(BasePtr[i+2]<<16)+(BasePtr[i+3]<<24);
//3
b=Table[ind>>3];
//4
if (b&mask8[ind&7])
The algorithm use a complexity of O(M/N) so it search only every “Mul” steps in the “BasePtr” string if there is a matching word.
In the point //2 construct the word from the BasePtr and probably a Cuda expert can suggest how to optimize this row .
In the point //3 I access the byte of the boolean vector and using a bit mask “mask8” in //4 I can watch the correct bit. Here probably I can optimize a little bit using mask8 not in a global memory because I need a readonly memory.
Are there other optimization possibilities?
I am using global memory for all the vectors but only the BasePtr is read/write vector , the “Table” and the “mask8” are readonly memory so here is possible to speedup using other type of memory .
Thank you in advance for your suggestions