problems with shared var

hi, guys

I’ve got an strange problem, first, look at piece of code:

__global__ void k_calc_cost(PTMatrix matrix, int ybase, int xbase)

{

	int tx = threadIdx.x;

	int ty = threadIdx.y;

	int bx = blockIdx.x;

	int by = blockIdx.y;

	int pbase = ybase + COST_BLOCK_SIZE * by;

	int qbase = xbase + COST_BLOCK_SIZE * bx;

	int pid = pbase + ty;

	int qid = qbase + tx;

	if (pid == 0 && qid == 0)

	{

  *matrix.d_max_candidate_reached_count = 0;

	}

	__shared__ float W[COST_BLOCK_SIZE][COST_BLOCK_SIZE];

	if (pid < matrix.p.number && qid < matrix.q.number)

	{

  Particle& pxyz = matrix.p.d_array[pid];

  Particle& qxyz = matrix.q.d_array[qid];

 float cost =

  	(pxyz.x - qxyz.x) * (pxyz.x - qxyz.x) +

  	(pxyz.y - qxyz.y) * (pxyz.y - qxyz.y) +

  	(pxyz.z - qxyz.z) * (pxyz.z - qxyz.z);

 W[ty][tx] = cost;

 //printf("cost (%d, %d) = %f\n", pid, qid, cost);

	}

.........

it’s a simple one, the problem is on this line:

W[ty][tx] = cost;

when I commeted it out, the program ran fast, but when I uncomment it, it became very slow, does anybody has idea on what happened? thanks.

What do you mean by ‘fast’ and ‘very slow’?

I think when you remove assignment nvcc just removes whole if { }. This if contains two global memory reads (uncoalesced, so very slow) and this is why difference is so noticeable.

really? if I comment out only one line:

// W[cy][cx] = cost;

and the nvcc removed all if {} ?

I find my cuda program isn’t faster than using only C++ code to iterat the matrix, so I have to debug by comment out some line, and find that this line effect the whole speed.

YOu may check if it’s true by adding "-keep " to your nvcc command line and examining code in .PTX file.

To improve performance you need to make you memory access coalesced. Check Programming Manual for more details.

You should also examine possible thread divergence due to if { } blocks: all threads in a warp should execute same instruction. If, for example, of 32 threads in warp only one takes branch then all others have to wait for it, slowing down overall execution.

thanks very much, I’m re-read the programming guide again :-)