Just give me an advice. write global

I have a mission which is to re-scale a matrix to a larger matrix.

For example, A>>B, A is 1024*1024 and B is 1111 * 1024, every element of B is computed from the same row of A.

my method is use 1024 blocks, each block has (1111+7)/8 threads, each threads compute 8 elements.

i think i have several problems:

1> the threads num is not multiples of 32

2> writing to global memory( pLine2 is the head address of a row ) , but the X-dim of B is not aligned, so pLine2[i]

   maybe writing several times? Is it need a __syncthreads() before writing? I tried and got no benifit.

3> could you give me some other advice?

follows my kernel function( sw is source width, dw is destination width ):

[codebox]

extern shared char array;

global

void ZoomH( z_float32* pDest, z_float32* pSrc, z_int32 sw, z_int32 sh, z_float32 r )

{

z_float32	sum;

z_float32* horz = (z_float32*)array;



z_int32 tid = threadIdx.x;

z_int32 dw = z_int32( sw * r );

z_float32* pLine1 = pSrc + blockIdx.x * sw;

z_float32* pLine2 = pDest + blockIdx.x * dw;

z_float32 pdr = z_float32( z_pi / r );

// load 8 elements to shared mem

z_int32 step = sw>>3;

if( tid < step )

{

	for( z_int32 i = tid; i < sw; i+=step) horz[i] = pLine1[i];

}

__syncthreads();



step = dw >> 3;

//#pragma unroll 5

for( z_int32 i = tid; i < dw; i += step )

{

	sum = 0.f;

	z_float32 alpha = pdr * i;

	z_float32 alpha2 = alpha - 3.1415926535897932384f * tid;

	z_float32 sinAlpha = __sinf( alpha );

	for( z_int32 j = tid; j < sw; j++, alpha2 -= 3.1415926535897932384f)

	{

		sum += horz[j] / alpha2;

	}

	for( z_int32 k = 0; k < tid; k++, alpha -= 3.1415926535897932384f)

	{

		sum += horz[k] / alpha;

	}

	sum *= sinAlpha;

	pLine2[i] = sum;

}

}[/codebox]

 For question 1, i tried to split the mission to 64,128,256 threads per block, and in each thread the num of element will not be the same. But i can’t see difference in performance.

It’s not a problem in compute ability 1.3 device, isn’t it?