jacobi relaxation indexing in two dimension


I am new to CUDA and trying to understand the recently published example from Brent Oster (NVIDIA): Programming The CUDA Architecture: A Look At GPU Computing (09/04/2009) on the calculation of a stencil for the Poisson equation in 2D.

Now, I implemented the idea almost 1:1 on my workstation (Quadro FX4200, SUSE Linux 11.1) and can run it without errors.

However, there are some things I don’t get and some help would be great.

The setup is done in …/projects/… folder and I am using a rather generic Makefile to compile everything.

Of particular interest to me is the way of indexing in 2D for implementing different equations in 2D, therefore the example from Brent Oster looks as a perfect candidate to study.

I understand the basics of CUDA. I worked thorough some examples like MatrixMul and understood things like cudaMalloc, cudaMemcpy, cutilCheckMsg, cutCreateTimer, etc., but still the following things are unclear:

Brent defines the grid:

int nBlockX = NX/(BLOCK_SIZE_X-2);

If NX=1024 and BLOCK_SIZE = 16, nBlockX = 1024/14 which is 73.xx rather than an “int” number. Sorry, I am still confused by this, I thought dividing an Array e.g. 1024 in Blocks should give a multiple of that?

In the Kernel, Brent defines

int tx = threadIdx.x;

		int ty = threadIdx.y;

		int bx = blockIdx.x*(BLOCK_SIZE_X-2)+1;

		int by = blockIdx.y*(BLOCK_SIZE_Y-2)+1;

		int x = tx + bx;

		int y = ty + by;

		__shared__	  float aa[BLOCK_SIZE_X][BLOCK_SIZE_Y];

		aa[tx][ty] = a_d[x+y*NX];


		if (tx > 0 && tx < BLOCK_SIZE_X-1 && ty > 0 && ty < BLOCK_SIZE_Y-1)


//	  a_d[x+y*NX] = 1;

		a_d[x+y*NX] = aa[tx+1][ty] + aa[tx-1][ty] + aa[tx][ty+1] + aa[tx][ty-1];


[i]Hint: I changed Brent’s ArraySizeX in NX and ArraySizeY in NY. Also the code presented above neglects the factor (1.0/4.0) multiplied to the stencil and also the subtraction of the function f*h^2, to be more simple even. Also the array aa was renamed from u_sh.





My setup for NX, NY:

NX = 32

NY = 32

Given that, the heart of the Kernel is the stencil operation, however before calculating a stencil I tried to set

a_d[x+y*NX] = aa[tx][ty] + 1;

and the results is somehow not understood by me, giving some 0’s and 1’s for the a_d array in rather “strange” order.

I understand what Brent tries to do with the idea of having an array of size NN and using an inlay array of size (N-2)(N-2) so that calculations like ± 1 can be done for the halos, however I am a bit lost with this nice example.

Just to clarify some of my understandings:

The big array NX*NY gets subdivided into blocks. Together these block fit the array in size. Comparable with MatrixMul example (see latest Programming Guide 2.x or …/projects/MatrixMul).

Next, an dynamic array of size BLOCK_SIZE_XBLOCK_SIZEY is created and loaded with values by the “big index” (x+y*NX) like shown in many other examples.

Within the if (…) {} clause, the stencil is calculated for tx 1…15 and ty 1…15

For the NX=32, NY=32 and BLOCK_SIZE_X=16 and BLOCK_SIZE_Y=16 example, there should be four blocks within a 32*32=1024 array? Is that right?

Then first,

int tx = threadIdx.x; // 0..15

		int ty = threadIdx.y; // 0..15

		int bx = blockIdx.x*(BLOCK_SIZE_X-2)+1; // 0*(16-2)+1 = 1

		int by = blockIdx.y*(BLOCK_SIZE_Y-2)+1; // 0*(16-2)+1 = 1

		int x = tx + bx; // 0..15 + 1

		int y = ty + by; // 0..15 + 1

Then second,

int tx = threadIdx.x; // 0..15

		int ty = threadIdx.y; // 0..15

		int bx = blockIdx.x*(BLOCK_SIZE_X-2)+1; // 1*(16-2)+1 = 15

		int by = blockIdx.y*(BLOCK_SIZE_Y-2)+1; // 1*(16-2)+1 = 15

		int x = tx + bx; // 0..15 + 15

		int y = ty + by; // 0..15 + 15


and (first)

aa[tx][ty] = a_d[x+y*NX]; // aa[0..15][0..15] = a_d[(0..15+1) + (0..15+1)*32]

Is that guessing right?

I am plotting results like

int MM = NX*NY;

		for (int i = 0; i<MM; ++i) {

		printf("GPU  a[%d] = %f\n",i, a_h[i]);


Some clarification would be highly appreciated! Thanks! :-D



Minor update:
OK I am working hard and the Gold sample stencil is now working in 2D with big index indexing (i + jNX) :-)
I decided for a 5
5 = 25 array outlay and NX-2 * NY-2 inlay array giving a real array of 3*3=9 and the rest for the halos.
Before allocating dynamic shared memory and creating (threading) blocks, I try to do it similar to the Gold sample but on GPU side…


Minor update 2:
Right, I am stepping forward and now also the GPU stencil is working, however not for shared array yet. I try to understand the simple one first. This is good news. :-)