Performance with memory assigment

fvazquez · March 7, 2009, 6:44am

Hello, I have a problem with this kernel:

I’m working in a Tesla C870 (compute capability 1.0):

(MAX_XBLOCK=MAX_YBLOCK=8)

For example, numVectors=2000, numNeurons=32 and dim=3205.

global

void updateUKE(float *d_examples,float *d_somItems, float *d_tmpD,float *d_tmpD1,float rr1,

                    int numVectors,int numNeurons,int dim)

{

int x = blockIdx.x*MAX_XBLOCK + threadIdx.x;

int y = blockIdx.y*MAX_YBLOCK + threadIdx.y;

if ((x < numVectors)&&(y < numNeurons)){

  int j;

float auxDist=0.0, temp=0.0;

for(j=0;j<dim;j++) {

    auxDist+=powf(d_examples[x*dim+j]-d_somItems[y*dim+j],2);                    

  }

d_tmpD[x*numNeurons+y]=auxDist;

  d_tmpD1[x*numNeurons+y]=-auxDist/rr1;

}//if(x<numVectors)&&(y<numNeurons)

__syncthreads();

}

The problem is that this kernel is very slow, it spent 0.4 secs but if I change the lines:

d_tmpD[x*numNeurons+y]=auxDist;

  d_tmpD1[x*numNeurons+y]=-auxDist/rr1;

by

d_tmpD[x*numNeurons+y]=temp;

  d_tmpD1[x*numNeurons+y]=temp;

the time now is 0.01 secs. Which is the diference between the register auxDist and temp ?? auxDist is a register that is calculated by each thread before the asignment d_tmpD[…]=auxDist. I don’t understand this… Can somebody help me??

Thanks, Francisco

SPWorley · March 7, 2009, 7:56am

It’s doing exactly what it should.

The optimizer is good.

If you replace the lines with temp, the compiler sees that the computation of auxDist isn’t used and therefore isn’t even needed.
So it removes the auxDist computation entirely, and the kernel runs much faster.

diddum · March 7, 2009, 8:38am

Well, in the line above you are using the function powf to compute a square.

That’s very bad, because powf is probably 10 times slower than a multiplication.

giovanni

fvazquez · March 7, 2009, 8:49am

Ok, but i tried the same code in cpu with a secuential loop, that is:

for(i=0;i<numVectors…

 for(j=0;j<numNeurons....

      for(k=0;k<dim......

and the compute time is the same that the execution in the gpu… why?? this algorithm can’t be paralelized with cuda??

Thanks,

fvazquez · March 7, 2009, 9:08am

No, with a multiplication the time is the same.

Best regards, Francisco

Topic		Replies	Views
cant understand this performance hit CUDA Programming and Performance	23	4822	May 17, 2010
Reduction operation on every thread and Execution Times CUDA Programming and Performance	5	4605	March 20, 2011
Reassigning array elements inside kernel CUDA Programming and Performance	5	821	September 5, 2014
Too big delay in code, problem CUDA Programming and Performance	3	955	October 22, 2009
Extremely long delay to affect a variable stored into global memory CUDA Programming and Performance	3	1687	April 4, 2011
Strange global memory behaviour CUDA Programming and Performance	1	1882	July 23, 2009
++threadReg as slow as ++d_globalMem[tx] memory access time CUDA Programming and Performance	2	2455	March 8, 2007
Kernel execution takes AGES CUDA Programming and Performance	7	3054	March 28, 2012
[Help]A optimization problem about register copy I meet a optimization problem when copy a register CUDA Programming and Performance	2	2180	September 17, 2009
Analysing the registers CUDA Programming and Performance	9	1273	March 13, 2012

Performance with memory assigment

Related topics