General approach to non-parallelizable systems (Laplace Equation)

I’m working on a 1-d laplace simulation as a toy project to prepare for a larger simulation. The array I’m working with is bigger than a block size, and part of the computation needs to finish before the next part starts. The basic code is:

for(j=0;j<5000;j++) {

//find the change to a (da)

   for(i=0;i<N;i++) {

		  if(i==0) { da[i] = 0.0;}

		  else if (i<(N-1)) {

			  da[i] = (a[i-1] - 2.0*a[i] + a[i+1])*one_over_kappa;

		  }

		  else if (i==(N-1)) { da[i] = 0.0; }

	}

//update a[i]

   for(i=0;i<N;i++) {

		  a[i] = a[i] + da[i];

   }

}

The approach I’ve taken to writing this code w/ CUDA is to create copies of a and da on the device and then use two different kernels, one to compute the value of da, and a second to update a. This seems to work, but I’m a novice. Three questions:

(1) the device memory is persistent right? I don’t have to re-copy back and forth after every kernel call?

(2) With a parallel library like MPI, a global barrier “MPI_Barrier()” is available that synchronizes all the threads. Do I understand correctly that CUDA doesn’t support such a call?

(3) I just don’t “get” the blocksize argument in the kernel call right now. Can somebody suggest a “good” size for a code like this and also provide a rationale for such a blocksize?

thanks!

Absolutely correct.

Allowing a kernel to complete is an implicit barrier. Due to the block execution model, a grid-wide barrier by any other means is impossible.

The best thing you can do is write your kernel so that it works with any block size you hand it dynamically. Then benchmark it at block sizes 32, 64, 96, … 512 and see which runs the fastest.

I want to understand this better. My present kernels look like the following. Does the programming model talk about the theory behind determining “optimal” block size?

// Kernel definition

__global__ void smoothDataOnDevice (float *a, float *da, int N, float one_over_kappa, int repeat)

{

  int idx;

  idx = blockIdx.x * blockDim.x + threadIdx.x;

  if (idx == 0)

	{

	  // do nothing, lower boundary

		da[idx] = 0.0;

	}

  else if (idx < (N - 1))

	{

	  da[idx] = (a[idx - 1] - 2.0 * a[idx] + a[idx + 1]) * one_over_kappa;

	}

  else if (idx == (N - 1))

	{

		// do nothing, upper boundary

		da[idx] = 0.0;

	}

}

__global__ void updateDataOnDevice (float *a, float *da)

{

  int idx;

  idx = blockIdx.x * blockDim.x + threadIdx.x;

  a[idx] = a[idx] + da[idx];

}

//Kernel usage

  int blockSize = 512;

  int nBlocks = N / blockSize;

  if (N % blockSize != 0) { nBlocks += 1; }

smoothDataOnDevice <<< nBlocks, blockSize >>> (a_d, da_d, N,

			one_over_kappa,num_steps_to_simulate);

updateDataOnDevice <<< nBlocks, blockSize >>> (a_d, da_d);

Warps are 32-wide. That’s the extent of the easy business, and from there it gets into hiding memory latency, how much shared memory use you have, how many blocks you can fit on an SM, etc.

Ok, so I tried running the code with varying size blockSize (I’m referring to the second argument in the cuda call, please correct my language if there’s a better term)

 updateDataOnDevice <<< nBlocks, blockSize >>> (a_d, da_d);

When blockSize is 512 (19532 blocks), the code runs fine, 256 (39063 blocks) also runs fine. When blockSize drops to 128 (78125 blocks) or below, the program immediately dies and the results are garbage. For fun, the code uses two linear float arrays that are each 10M elements long (so each array size is about 40MB). Am I running into hardware limits with this blocksize? Naively, since the cars has something like 256MB of memory, I figured I’d be ok with this data size.

max number of blocks in a dimension is 65535.

Hi,
make the blocks “2-dimensional”. Each dimension can have that 65k values…
Philipp.