2D jacobi CUDA problem

I once found this code and topic on the internet. But when I read it, I found some problems, at least my understanding for this code so far. Hopefully anyone could give me some hints or help on this. Thx!

Code below:

#include <cuda_runtime.h>

#include <cutil.h>

#include <stdlib.h>

#include <stdio.h>

#define BLOCK_SIZE_X 16

#define BLOCK_SIZE_Y 16

global void JacobiRelaxationGPU(float *u_d, float *f_d, int ArraySizeX, int ArraySizeY, float h)

{

    int tx = threadIdx.x; // here tx=[0:15]

    int ty = threadIdx.y; // here ty=[0:15]

    int bx = blockIdx.x*(BLOCK_SIZE_X-2)+1; //bx= [1:1009]

    int by = blockIdx.y*(BLOCK_SIZE_X-2)+1; //by= [1:1009]

    int x = tx + bx; //x=[1:1024]

    int y = ty + by; //y=[1:1024]

shared float u_sh[BLOCK_SIZE_X][BLOCK_SIZE_Y];

u_sh[tx][ty] = u_d[x+y*ArraySizeX]; // I don’t understand here. Since x: [1, 1024] and y: [1:1024], if x=1024 and y=1024, and ArraySizeX=1024,

                                                           // the index at this moment exceeds the boundary of the matrix u_d upperbound 1024*1024!

__syncthreads();

if (tx>0 && tx<BLOCK_SIZE_X-1 && ty>0 && ty<BLOCK_SIZE_Y-1)

    {

        u_d[x+y*ArraySizeX] = (1.0/4.0)*(u_sh[tx+1][ty]+u_sh[tx-1][ty]+u_sh[tx][ty+1]+u_sh[tx][ty-1]-f_d[x+y*ArraySizeX]*h*h);

    }

}

int main()

{

    double nResult = 0;

float *u_h;

    float *f_h;

    float *u_d;

    float *f_d;

    int ArraySizeX = 1024*16;

    int ArraySizeY = 1024*16;

    long size = ArraySizeX * ArraySizeY * sizeof(float);

    long mysize = ArraySizeX * ArraySizeY;

//allocate arrays on host

    u_h = (float*)malloc(size);

    f_h = (float*)malloc(size);

//allocate arrays on device

    cudaMalloc((void**)&u_d, size);

    cudaMalloc((void**)&f_d, size);

long i=0;

    for (i=0; i<mysize; i++)  // matrix initialization

    {

        u_h[i] = 3;

        f_h[i] = 4;

    }

cudaMemcpy(u_d, u_h, size, cudaMemcpyHostToDevice);

    cudaMemcpy(f_d, f_h, size, cudaMemcpyHostToDevice);

gettimeofday(&afterLoadtp1, NULL);

int nBlocksX = ArraySizeX/(BLOCK_SIZE_X-2);

    int nBlocksY = ArraySizeY/(BLOCK_SIZE_Y-2);

dim3 dimBlock(BLOCK_SIZE_X, BLOCK_SIZE_Y);

    dim3 dimGrid(nBlocksX, nBlocksY);

JacobiRelaxationGPU<<<dimGrid, dimBlock>>>(u_d, f_d, ArraySizeX, ArraySizeY, 1);

cudaMemcpy(u_h, u_d, size, cudaMemcpyDeviceToHost);

free(u_h);

    free(f_h);

    cudaFree(u_d);

    cudaFree(f_d);

    printf("**********Execuation is done*********!!!\n");

What you must remember is that arrays and threads are 0 indexed. So x and y should go from 0-1023.

So for (x,y) is (0,0) you get u_d[0+0*ArraySizeX]=u_d[0].

Let’s say you want the second column or (1,0), you now get u_d[1+0*ArraySizeX = u_d[1]

Now lets say you want the second row first column, or (0,1), you get u_d[0+1*1024] = u_d[1024], because each row has 1024 elements, the first row has 0-1023, so 1024 is the first element of the second row.

This should continue all the way on until you get 1023+1023*1024 to be the last element.

Hope this helps.

Thank you for you reply. Yes, the threadIdx.x and threadIdx.y, their index all starts from 0 to 15 (sorry, my mistake, I wrote 1023 here)in this case. While in the code here, u_d[x+y*ArraySizeX], x = bx+tx; and y=ty+by, which their indices start from 1 rather than 0.

No. threadIdx.x and threadIdx.y range from 0 to 16. Where you are mistaken is that x and y resolve to the indices within the spatial grid, and x + y*ArraySizeX is the offset in linear memory. Both should be correct.

Thank you for your reply. Sorry, my mistake for the threadIdx.x and threadIdx.y arrange. But, the threadIdx = [0, 15], cannot be 16 in this case.