Scan an Image and Image Average Filter by CUDA/C++

I wrote a program by CUDA/C++ that it scans image with a 20×20 block. It jump 20 pixels in cols and rows,but I want it jump only 1 pixel in rows and cols in each time. For example start 20x20 block is from 0,0 and then it jump to 20,0, both of in rows and cols. But I want after 0,0 and read a 20x20 block, it starts from 1,0 and read next block.

GPU Function:

__global__ void _TEST_GPU(uchar* mt, uchar* motion, size_t step, int h, int w)
    int row = blockIdx.y * blockDim.y + threadIdx.y;
    int col = blockIdx.x * blockDim.x + threadIdx.x;
    int index = col + row*(step / sizeof(uchar));
    mt[index] = 255;

Call GPU Function:

dim3 block(20, 20);
dim3 grid(image.cols / block.x, image.rows / block.y);
_TEST_GPU << <grid, block >> > ((uchar *), (uchar *), GMat.step, dst.rows, dst.cols);