Some advices on improving performance?


In the following code,I want to implement a search function .Searching a keyword in the given a string.

Assumed that the length of the string Tis 518,and the kewword is 6,launch the kernel with 512 thread and only one block(just a experiment).so one thread search a fix location T[tid],I thought it might bring a speedup searching performance,but failed.Someone can give me some advice on this?Thx in advance.

__global__ void

testKernel( char* d_T, char* d_P,int *d_Dist,int *d_flag) 


    __shared__ char T[518];

    __shared__ int flag;

    __shared__ char P[6];

    int m=6;

  const unsigned int bid=blockIdx.x;

   const unsigned int tid=threadIdx.x;




         for(int i=512;i<=516;i++)





   T[tid]=d_T[tid];//the above code tranfer data from gmem to shmem.



   int i=0;










1:Is it necessary to load data to shmem?

2:Is it coalesced?