I have a mission which is to re-scale a matrix to a larger matrix.
For example, A>>B, A is 1024*1024 and B is 1111 * 1024, every element of B is computed from the same row of A.
my method is use 1024 blocks, each block has (1111+7)/8 threads, each threads compute 8 elements.
i think i have several problems:
1> the threads num is not multiples of 32
2> writing to global memory( pLine2 is the head address of a row ) , but the X-dim of B is not aligned, so pLine2[i]
   maybe writing several times? Is it need a __syncthreads() before writing? I tried and got no benifit.
3> could you give me some other advice?
follows my kernel function( sw is source width, dw is destination width ):
[codebox]
extern shared char array;
global
void ZoomH( z_float32* pDest, z_float32* pSrc, z_int32 sw, z_int32 sh, z_float32 r )
{
z_float32 sum;
z_float32* horz = (z_float32*)array;
z_int32 tid = threadIdx.x;
z_int32 dw = z_int32( sw * r );
z_float32* pLine1 = pSrc + blockIdx.x * sw;
z_float32* pLine2 = pDest + blockIdx.x * dw;
z_float32 pdr = z_float32( z_pi / r );
// load 8 elements to shared mem
z_int32 step = sw>>3;
if( tid < step )
{
for( z_int32 i = tid; i < sw; i+=step) horz[i] = pLine1[i];
}
__syncthreads();
step = dw >> 3;
//#pragma unroll 5
for( z_int32 i = tid; i < dw; i += step )
{
sum = 0.f;
z_float32 alpha = pdr * i;
z_float32 alpha2 = alpha - 3.1415926535897932384f * tid;
z_float32 sinAlpha = __sinf( alpha );
for( z_int32 j = tid; j < sw; j++, alpha2 -= 3.1415926535897932384f)
{
sum += horz[j] / alpha2;
}
for( z_int32 k = 0; k < tid; k++, alpha -= 3.1415926535897932384f)
{
sum += horz[k] / alpha;
}
sum *= sinAlpha;
pLine2[i] = sum;
}
}[/codebox]