Instruction issued counter

global void PcreMatch(int* pcreTbl, int* pcreTblSize, char* packet, PacketInfo* pkInfo, bool* pcreRes)
{
unsigned char ch = 0;
int matched = false;
int st = 0, stBase = 0, size = 0;
char *base = 0;

base = packet + pkInfo[threadIdx.x / 3].base;
size = pkInfo[threadIdx.x / 3].size;

/* Base case */
stBase = pcreTblSize[blockIdx.x + threadIdx.x % 3];
st = stBase;

for(int i = 0; i < size; i++) {
ch = (unsigned char) base[i];
/* matched */
st = stBase + pcreTbl[st * CSIZE + ch];

if(st < 0)
{
  /*  when st == -1 : dead state
    when st == -2 : accept state */
  if(st == -2)
    matched = true;
  break;
}

}
pcreRes[threadIdx.x + blockIdx.x * blockDim.x] = matched;
}

Above is my code. I am making some regular expression matcher. I made a DFA and upload it on a pcreTbl.
I optimize my code but I don’t understand why profiler gives me very bad number.

Achieved Instruction Per Byte Ratio: 132.19 ( Balanced Instruction Per Byte Ratio: 4.16 )
Achieved Occupancy: 0.80 ( Theoretical Occupancy: 1.00 )
IPC: 1.78 ( Maximum IPC: 2 )
Achieved global memory throughput: 0.99 ( Peak global memory throughput(GB/s): 192.38 )

I don’t know why instruction per byte is too high and global memory throughput is low.
Is the computation part bottlneck?

Could you help me ? :(

could you give me the configuration of your grid and blocks?
the global memory throughput is quit low