Need help implementing a simple kernel! I need to write a kernel for a simple euler forward meth

Hello everybody! I am fairly new to CUDA and I was writing a program to solve a heat equation using a simple Euler Method. I wrote the function in C++ and it seems to work fine but when I tried to implement the kernel the results are slightly different. e.g:

GPU CPU

1 302.0542908 302.0537720

2 -15.1334448 -15.1327314

3 -15.0020161 -15.0026360

The difference is usually greater when I increment the number of time steps for the Euler method. Here’s both the kernel and the function.

[codebox]void fwdEuler(float* u, const float dt, const float dx)

{

// Function for taking one Forward-Euler time-step. For all

// interior grid points, it replaces U[j] with U[j] at the next time

// step. Values at the boundaries j=0 and j=nj are unchanged. The

// array next_u is necessary for temporary storage.

const float nu = dt / (dx * dx);

float* next_u;



next_u = new float[N + 1];

for (int j = 1; j < N; j++)

	next_u[j] = u[j] + nu * (u[j - 1] - 2. * u[j] + u[j + 1]);

for (int j = 1; j < N; j++)

	u[j] = next_u[j];

}[/codebox]

When I call the function in the main:

[codebox]for (int time_step = 0; time_step < nsteps; time_step++)

	fwdEuler(h_b, dt, dx);

}[/codebox]

Here’s the kernel:

[codebox]global void fwdEulerKernel(float u, float next_u, const float dx,

					const float dt, int nsteps, int N)

{

int idx = blockDim.x * blockIdx.x + threadIdx.x;

float nu = dt / (dx * dx);



    // Skip the first element in u

if(blockIdx.x < 1)

	idx +=1; 

	

for (int i = 0; i < nsteps; i++)

{

	if(idx < N)

	{	

		next_u[idx] = u[idx] + nu * (u[idx - 1] - 2. * u[idx] + u[idx + 1]);

		u[idx] = next_u[idx];

	}// end if

}// end for i

}[/codebox]

Any suggestions??? Thanks!

Doesn’t look like anything to worry about. You have about 6 significant figures of agreement in each result you quote, which is about what you would expect for single precision floating point, I would have thought.

And this is completely normal behavior for floating point. The longer the chain of operations you perform, the more accumulation of error. You would see the same thing on the CPU if you had an even more precise standard to compare to.

Thanks for the quick reply guys!

I think that there is a logical error somewhere else in my code. For large numbers of N, the difference between both values is greater, but not in a consistent way. It varies a lot, even though the “next” value is calculated using the “previous” one. For example, I get a -100 in the CPU and 100 in the GPU but for the next value I get something like 250 and 240 respectively which makes no sense… I will go home and check it again later. If I can’t find the error I will post the code with a more detailed explanation.