asymmetric kernels two kernel apparently symmetric with different performance

Hi to all,

I’m experimenting a strange behavior that I can’t understand.

I have two kernels which operate on the same objects but in a symmetrical way.

The first one multiplies any line of a Matrix V for a vector P and puts the results in a complex matrix C:

So

[b]

[u]  C(i,:).x = V(i,:)*P(:) 

  C(i,:).y =0.         [/u]       for any line i

[/b]

The second one works in symmetrical way:

[b]

 V(i,;) = C(i,:).x*P(:)    for any line i

[/b]

I aspect the same execution time for both kernels but the first kernel is twice faster than the second one.

Evidentially I’m missing something.

Some idea, some help?

thanks

Marco

here the 2 kernels:

_

[codebox]

global void

second(cucmplx* cvn, cureal* vn, cureal* po, int height,int nfftrec,int pthsize)

{

int x = IMUL(blockDim.x,blockIdx.x) + threadIdx.x;

int y = IMUL(blockDim.y,blockIdx.y);

if(x<nfftrec && y<height){

 int ind  = IMUL(y, pthsize)+x;     

 vn[ind] = cvn[ind].x*pot[x];

}

}

[/codebox]

[codebox]

global void

first(cucmplx* cvn, cureal* vn, cureal* pot, int height,int nfftrec,int pthsize)

{

int x = IMUL(blockDim.x,blockIdx.x) + threadIdx.x;

int y = IMUL(blockDim.y,blockIdx.y);

if(x<nfftrec && y<height){

 int ind  = IMUL(y, pthsize)+x;

 cvn[ind].x = vn[ind]*pot[x];

 cvn[ind].y = 0.0f;

}

}

[/codebox]

Simple - the kernels don’t do the same amount of work. The ‘first’ kernel (which is posted second…) writes to cvn[ind].y and this has no mirror in the ‘second’ kernel (which is posted first…). It should certainly be slower, though a factor of two is a big slower than I would expect. Remember - unless your kernel does a lot of sums you’re almost certain to be memory bound in CUDA.

Tigga thanks for you reply and sorry for the order of the posted kernels.

Your argumentation looks like good but the problem is that

the “first” kernel is faster than the “second”

I found the solution to my problem.

The problem was due to the fact that in the second kernel

I access to the global memory to read cvn[ind].x only and not cvn[ind].y

and the compiler break coalescing.

I added, as suggested here opennvidia,

a volatile variable.

Now the second kernel is:

__global__ void 

second(cucmplx* cvn,   cureal* vn,  cureal* po, int height,int nfftrec,int pthsize)	

{ 

   int x = IMUL(blockDim.x,blockIdx.x) + threadIdx.x;

   int y = IMUL(blockDim.y,blockIdx.y);

   volatile cucmplx locc;

   if(x<nfftrec && y<height){

	 int ind  = IMUL(y, pthsize)+x;	 

	locc.x = cvn[ind].x;

	locc.y = cvn[ind].y;

	vn[ind] = locc.x*pot[x];

   }

}

Now the running time of the two kernels are comparable.

bye

marco