Hello, I have a problem with this kernel:
I’m working in a Tesla C870 (compute capability 1.0):
(MAX_XBLOCK=MAX_YBLOCK=8)
For example, numVectors=2000, numNeurons=32 and dim=3205.
global
void updateUKE(float *d_examples,float *d_somItems, float *d_tmpD,float *d_tmpD1,float rr1,
int numVectors,int numNeurons,int dim)
{
int x = blockIdx.x*MAX_XBLOCK + threadIdx.x;
int y = blockIdx.y*MAX_YBLOCK + threadIdx.y;
if ((x < numVectors)&&(y < numNeurons)){
int j;
float auxDist=0.0, temp=0.0;
for(j=0;j<dim;j++) {
auxDist+=powf(d_examples[x*dim+j]-d_somItems[y*dim+j],2);
}
d_tmpD[x*numNeurons+y]=auxDist;
d_tmpD1[x*numNeurons+y]=-auxDist/rr1;
}//if(x<numVectors)&&(y<numNeurons)
__syncthreads();
}
The problem is that this kernel is very slow, it spent 0.4 secs but if I change the lines:
d_tmpD[x*numNeurons+y]=auxDist;
d_tmpD1[x*numNeurons+y]=-auxDist/rr1;
by
d_tmpD[x*numNeurons+y]=temp;
d_tmpD1[x*numNeurons+y]=temp;
the time now is 0.01 secs. Which is the diference between the register auxDist and temp ?? auxDist is a register that is calculated by each thread before the asignment d_tmpD[…]=auxDist. I don’t understand this… Can somebody help me??
Thanks, Francisco