Getting zero values on 2D large kernels

Hi,

The index should be calculated like this:
(blockDim indicates the 2D block number)

int i = threadIdx.x + blockIdx.x * blockDim.x;
int j = threadIdx.y + blockIdx.y * blockDim.y;

We can get the correct assignment after updating this.

Thanks.