jacobi relaxation indexing in two dimension

ludx · April 12, 2009, 4:23pm

Hi,

I am new to CUDA and trying to understand the recently published example from Brent Oster (NVIDIA): Programming The CUDA Architecture: A Look At GPU Computing (09/04/2009) on the calculation of a stencil for the Poisson equation in 2D.

Now, I implemented the idea almost 1:1 on my workstation (Quadro FX4200, SUSE Linux 11.1) and can run it without errors.

However, there are some things I don’t get and some help would be great.

The setup is done in …/projects/… folder and I am using a rather generic Makefile to compile everything.

Of particular interest to me is the way of indexing in 2D for implementing different equations in 2D, therefore the example from Brent Oster looks as a perfect candidate to study.

I understand the basics of CUDA. I worked thorough some examples like MatrixMul and understood things like cudaMalloc, cudaMemcpy, cutilCheckMsg, cutCreateTimer, etc., but still the following things are unclear:

Brent defines the grid:

int nBlockX = NX/(BLOCK_SIZE_X-2);

If NX=1024 and BLOCK_SIZE = 16, nBlockX = 1024/14 which is 73.xx rather than an “int” number. Sorry, I am still confused by this, I thought dividing an Array e.g. 1024 in Blocks should give a multiple of that?

In the Kernel, Brent defines

int tx = threadIdx.x;

		int ty = threadIdx.y;

		int bx = blockIdx.x*(BLOCK_SIZE_X-2)+1;

		int by = blockIdx.y*(BLOCK_SIZE_Y-2)+1;

		int x = tx + bx;

		int y = ty + by;

		__shared__	  float aa[BLOCK_SIZE_X][BLOCK_SIZE_Y];

		aa[tx][ty] = a_d[x+y*NX];

		__syncthreads();

		if (tx > 0 && tx < BLOCK_SIZE_X-1 && ty > 0 && ty < BLOCK_SIZE_Y-1)

		{

//	  a_d[x+y*NX] = 1;

		a_d[x+y*NX] = aa[tx+1][ty] + aa[tx-1][ty] + aa[tx][ty+1] + aa[tx][ty-1];

		}

[i]Hint: I changed Brent’s ArraySizeX in NX and ArraySizeY in NY. Also the code presented above neglects the factor (1.0/4.0) multiplied to the stencil and also the subtraction of the function f*h^2, to be more simple even. Also the array aa was renamed from u_sh.

Further,

BLOCK_SIZE_X = 16

BLOCK_SIZE_Y = 16

[/i]

My setup for NX, NY:

NX = 32

NY = 32

Given that, the heart of the Kernel is the stencil operation, however before calculating a stencil I tried to set

a_d[x+y*NX] = aa[tx][ty] + 1;

and the results is somehow not understood by me, giving some 0’s and 1’s for the a_d array in rather “strange” order.

I understand what Brent tries to do with the idea of having an array of size NN and using an inlay array of size (N-2)(N-2) so that calculations like ± 1 can be done for the halos, however I am a bit lost with this nice example.

Just to clarify some of my understandings:

The big array NX*NY gets subdivided into blocks. Together these block fit the array in size. Comparable with MatrixMul example (see latest Programming Guide 2.x or …/projects/MatrixMul).

Next, an dynamic array of size BLOCK_SIZE_XBLOCK_SIZEY is created and loaded with values by the “big index” (x+y*NX) like shown in many other examples.

Within the if (…) {} clause, the stencil is calculated for tx 1…15 and ty 1…15

For the NX=32, NY=32 and BLOCK_SIZE_X=16 and BLOCK_SIZE_Y=16 example, there should be four blocks within a 32*32=1024 array? Is that right?

Then first,

int tx = threadIdx.x; // 0..15

		int ty = threadIdx.y; // 0..15

		int bx = blockIdx.x*(BLOCK_SIZE_X-2)+1; // 0*(16-2)+1 = 1

		int by = blockIdx.y*(BLOCK_SIZE_Y-2)+1; // 0*(16-2)+1 = 1

		int x = tx + bx; // 0..15 + 1

		int y = ty + by; // 0..15 + 1

Then second,

int tx = threadIdx.x; // 0..15

		int ty = threadIdx.y; // 0..15

		int bx = blockIdx.x*(BLOCK_SIZE_X-2)+1; // 1*(16-2)+1 = 15

		int by = blockIdx.y*(BLOCK_SIZE_Y-2)+1; // 1*(16-2)+1 = 15

		int x = tx + bx; // 0..15 + 15

		int y = ty + by; // 0..15 + 15

etc.

and (first)

aa[tx][ty] = a_d[x+y*NX]; // aa[0..15][0..15] = a_d[(0..15+1) + (0..15+1)*32]

Is that guessing right?

I am plotting results like

int MM = NX*NY;

		for (int i = 0; i<MM; ++i) {

		printf("GPU  a[%d] = %f\n",i, a_h[i]);

		}

Some clarification would be highly appreciated! Thanks! :-D

Cheers,

Lx

ludx · April 13, 2009, 12:10am

Minor update:
OK I am working hard and the Gold sample stencil is now working in 2D with big index indexing (i + jNX) :-)
I decided for a 55 = 25 array outlay and NX-2 * NY-2 inlay array giving a real array of 3*3=9 and the rest for the halos.
Before allocating dynamic shared memory and creating (threading) blocks, I try to do it similar to the Gold sample but on GPU side…

Lx

ludx · April 13, 2009, 10:35am

Minor update 2:
Right, I am stepping forward and now also the GPU stencil is working, however not for shared array yet. I try to understand the simple one first. This is good news. :-)

Topic		Replies	Views
Urgent help with threads please! CUDA Programming and Performance	21	10787	March 6, 2008
Double loop, sum on j for each i. atomicAdd()? CUDA Programming and Performance	20	7023	March 25, 2012
3D arrays - where to start? CUDA Programming and Performance	12	20321	October 13, 2013
GPU/CPU precision comparison and Kernel instructions question CUDA Programming and Performance	5	679	April 4, 2017
Threads begginer question CUDA Programming and Performance	8	8122	July 16, 2007
Problems when using shared memory CUDA Programming and Performance cuda	1	312	May 15, 2024
Annoying problems with memory and/or syntax CUDA Programming and Performance	19	4769	April 8, 2008
3D grids CUDA Programming and Performance	13	10004	June 29, 2009
Newbie help on thread blocks CUDA Programming and Performance	22	10600	December 24, 2008
Is it possible to process multidimensional arrays inside the kernel? CUDA Programming and Performance	13	9047	March 31, 2015

jacobi relaxation indexing in two dimension

Related topics