Need help in GPU programming

"and which commands I should add in reduction scan of smem??? "

i noted the post/ samples such that you can distill the method of programming a reduction/ sum scan
seemingly, inserting the reduction/ sum scan is really what is keeping you back at this point

here is a very elementary sum scan; it is probably not the most efficient, but it would serve your purpose, it is rather elementary to grasp (and thus comprehend scanning), and it also permits summing across multiple warps (block size > 32)

{
int cnt1, cnt2;
double dbl1;

cnt1 = 1;
cnt2 = 0;

point_A:

if ((threadIdx.x + 1) > cnt1)
{
dbl1 = smem[threadIdx.x] + smem[threadIdx.x - cnt1];
cnt2 = 1;
}

__syncthreads();

if (cnt2 > 0)
{
smem[threadIdx.x] = dbl1;
cnt2 = 0;
}

__syncthreads();

cnt1 = cnt1 * 2;

if (cnt1 < blockDim.x)
{
goto point_A;
}
}

in the above, i have assumed the underlying type to be double, but it may very well be any other type
the sum is stored in the last array element of smem
the above assumes the array to sum has elements equal to the number of threads in the block - an assumption that holds in your particular case
i am sure there are countless other reduction/ scan examples

Sir have you any idea about thrust in CUDA?