Help me about coalescing my program run too slow

Hi guys :D
I try to use cuda to computing as fast as possible
but the result i receive that my cuda program is too slow.
Some once in the forum discuss about coalescing but until now i don’t know how to use it. <img src=‘http://hqnveipbwb20/public/style_emoticons/<#EMO_DIR#>/crying.gif’ class=‘bbc_emoticon’ alt=’:’(’ />
please give me your idea

call kernel function
dimGrid=(38,1)
dimBlock=(352,1,1)
Subtraction<<<dimGrid,dimBlock>>>(templateCol,templateRow,sourceCol,sourceGpu,templateGpu,positionGpu,targetGpu);

and this is kernel function

global void Subtraction(int templateX,int templateY,long sourceX,unsigned char *sourceF,unsigned char *templateF,int positionData,unsigned char targetF)
{
int bx = blockIdx.x;
int tx = threadIdx.x;
if(bx>7&&bx!=11&&bx!=15&&bx!=21&&bx!=31&&bx!=35&&tx<=348)
{
int x=positionData[bx
2];
int y=positionData[bx
2+1];
int start=x+(y+1)sourceX;
for(int i=0;i<templateY;i++)
{
targetF[i*sourceX+start+tx]=255-abs(templateF[tx+i
templateX]-sourceF[i*sourceX+start+tx]);
__syncthreads();
}
}
}

when i use cuda visual profiler to optimize
gld_coherent =0;
gld_incoherent =2884136;
gst_coherent=0;
gst_incoherent=2881344;

The reason all of your reads are uncoalesced is that you are reading from unsigned char arrays. Coalescing is only possible when reading from arrays where the data element size is 32, 64 or 128 bits.

thank you for your reply SEIBER

It is mean that coalescing is only occur in the type 32,64 or 128 bits
if I use unsigned char . how can I apply coalescing in this method
should i make a struct with 32 bits, but in this struct i use only 8 bits of 32 bits

I really don’t know how to solve this problem, please give me a help
thank you very much

You can read four (or eight) chars per thread at a time by reading a struct like

struct __align__((4)) uint8_4 {

uint8_t a,b,c,d;

}

from each thread into shared memory. After this you can do a syncthreads and process one char per thread in shared memory, then another sync and write back similarly.

first,unsigned char is only 1 byte,it is not appropriate to read,it will be rather slow.Indeed,short int is not fit ,too.

When it comes to so called "coalescing ",it mean thread0 reads element x,thread1 reads element(x + 1),you know,adjacent thread reads adjacent element,just like a comb.

such code is used usually:

for (int i = threadIdx.x;i < ??;i += blockDim.x)

it uses “coalescing”,in my opinion.

When you have time,read more examples offered by NVidia

thank you WUMPUS and WAKENSKY :D

your idea will be helpful for me
I am trying to do it. :)