global void PcreMatch(int* pcreTbl, int* pcreTblSize, char* packet, PacketInfo* pkInfo, bool* pcreRes)
{
unsigned char ch = 0;
int matched = false;
int st = 0, stBase = 0, size = 0;
char *base = 0;
base = packet + pkInfo[threadIdx.x / 3].base;
size = pkInfo[threadIdx.x / 3].size;
/* Base case */
stBase = pcreTblSize[blockIdx.x + threadIdx.x % 3];
st = stBase;
for(int i = 0; i < size; i++) {
ch = (unsigned char) base[i];
/* matched */
st = stBase + pcreTbl[st * CSIZE + ch];
if(st < 0)
{
/* when st == -1 : dead state
when st == -2 : accept state */
if(st == -2)
matched = true;
break;
}
}
pcreRes[threadIdx.x + blockIdx.x * blockDim.x] = matched;
}
Above is my code. I am making some regular expression matcher. I made a DFA and upload it on a pcreTbl.
I optimize my code but I don’t understand why profiler gives me very bad number.
Achieved Instruction Per Byte Ratio: 132.19 ( Balanced Instruction Per Byte Ratio: 4.16 )
Achieved Occupancy: 0.80 ( Theoretical Occupancy: 1.00 )
IPC: 1.78 ( Maximum IPC: 2 )
Achieved global memory throughput: 0.99 ( Peak global memory throughput(GB/s): 192.38 )
I don’t know why instruction per byte is too high and global memory throughput is low.
Is the computation part bottlneck?
Could you help me ? :(