Some advices on improving performance?

Hi,guys.

In the following code,I want to implement a search function .Searching a keyword in the given a string.

Assumed that the length of the string Tis 518,and the kewword is 6,launch the kernel with 512 thread and only one block(just a experiment).so one thread search a fix location T[tid],I thought it might bring a speedup searching performance,but failed.Someone can give me some advice on this?Thx in advance.

__global__ void

testKernel( char* d_T, char* d_P,int *d_Dist,int *d_flag) 

{

    __shared__ char T[518];

    __shared__ int flag;

    __shared__ char P[6];

    int m=6;

  const unsigned int bid=blockIdx.x;

   const unsigned int tid=threadIdx.x;

   if(tid==0)

   {

         flag=0;

         for(int i=512;i<=516;i++)

        T[i]=d_T[i];

    }

   if(tid<m)

        P[tid]=d_P[tid];

   T[tid]=d_T[tid];//the above code tranfer data from gmem to shmem.

	

   __syncthreads();

   int i=0;

   for(;i<m&&P[i]==T[tid+i];i++);

    if(i==m)

        flag=1;

   __syncthreads();

    if(tid==0)

         if(flag!=0)

	*d_flag=flag;

}

edit:

1:Is it necessary to load data to shmem?

2:Is it coalesced?