help convert to a kernel pleeeeease

Hi there

I’m completely new to this CUDA programming and having a hard time grasping the concepts. I’ve written code in C and now trying to convert the main computation part to CUDA code. I’ve attempted it but it is return the wrong results. can anyone please help me? I attached the C version of the computation part that needs to be converted to a kernel. I have been trying for the past 2 days but it keeps returning the wrong results. any ideas will be much appreciated.
fdtd.cpp (2.05 KB)

Your code is incomplete, so it is really hard. you have 2 for loops within the first loop that seem independent on the outer loop (n). INDEX is not defined, IB is also unknown.

What part of your algorithm are you trying to parallelize? Or where do you see the parallelism possible?

O my bad guess i should have explained the code. that is the part of my code i want to convert to a kernel. it gets called from the driver program. that piece of code runs on the CPU, then I want to write a kernel based on that and run it on the GPU, and do performance tests.

MAX - length of the source matrix.

INDEX - is a macros that i defined so that i could access the array in a 2d way for conveniance. INDEX = i*width+j.

IE and IB are just the widths for the marcos so that it can go to the next row.

dd is just a value that n must be less than.

I know this code doesn’t make sense on its own but its basically the part of the finite differnce time domain where it steps through time and calculating velocity and stress field vectors.

The part I want to make parallel is the two two nested for loops.

okay, so you want to do something like:

for(n=0;n<MAX;n++) {

kernel_V<<<>>>()

kernel_T<<<>>>()

}

Then my advise is :

  • if your c’s fit in constant memory, put them in there

  • put your V’s and T’s in CudaArrays and bind 2D textures to them

  • make a 2D grid and block, i = blockIdx.x * blockDim.x + threadIdx.x; j = blockIdx.y * blockDim.y + threadIdx.y;

then

if ((i <= IE) && (j <= JE))

Vx[i][j] += c1[INDEX(i,j,IB)] * ( tex2Dref(Txx_ref, i,j) -tex2Dref(Txx_ref,i-1,j) + tex2Dref(Txz_ref(i,j) - tex2Dfetch(Txz_ref(i,j-1));

I hope you get the idea ;)

not quite. what I tried was (this is in the kernel)

j = blockIdx.x * blockDim.x + threadIdx.x
i = blockIdx.y * blockDim.y +threadIdx.y

for(int n=0;n<MAX;n++)
{
if(j>=2&&j<=IE&&i>=2&&i<=IE)
{
Vx[INDEX(i,j,IE)] = Vx[INDEX(i,j,IE)] + c1[INDEX(i,j,IB)] * ( (Txx[INDEX(i,j,IB)]-Txx[INDEX(i-1,j,IB)]) + (Txz[INDEX(i,j,IB)]-Txz[INDEX(i,j-1,IB)]) );
Vz[INDEX(i,j,IE)] = Vz[INDEX(i,j,IE)] + c1[INDEX(i,j,IB)] * ( (Txz[INDEX(i,j,IB)]-Txz[INDEX(i-1,j,IB)]) + (Tzz[INDEX(i,j,IB)]-Tzz[INDEX(i,j-1,IB)]) );
}
if(j>=2&&j<=IE-1&&i>=2&&i<=IE-1)
{
Txx[INDEX(i,j,IB)] = Txx[INDEX(i,j,IB)] + c2[INDEX(i,j,IB)] *(Vx[INDEX(i+1,j,IE)]-Vx[INDEX(i,j,IE)]) + c3[INDEX(i,j,IB)] *(Vz[INDEX(i,j+1,IE)]-Vz[INDEX(i,j,IE)]);
Tzz[INDEX(i,j,IB)] = Tzz[INDEX(i,j,IB)] + c2[INDEX(i,j,IB)] *(Vz[INDEX(i,j+1,IE)]-Vz[INDEX(i,j,IE)]) + c3[INDEX(i,j,IB)] *(Vx[INDEX(i+1,j,IE)]-Vx[INDEX(i,j,IE)]);
Txz[INDEX(i,j,IB)] = Txz[INDEX(i,j,IB)] + c4[INDEX(i,j,IB)] * ( (Vx[INDEX(i,j+1,IE)]-Vx[INDEX(i,j,IE)]) + (Vz[INDEX(i+1,j,IE)]-Vz[INDEX(i,j,IE)]) );

  }

  if (n < dd)
 {			
		Tzz[INDEX((IE/2),(JE/2),IB)] =  source[n];			
		Txx[INDEX((IE/2),(JE/2),IB)] =  source[n];
 } 

}

This was how I implemented my kernel and it gave me the wrong results when I compared it to the CPU results

Have you tried this series of articles? Its quite useful:
http://www.ddj.com/architect/207200659

That is to be expected. You are reading i-1,j and i+1,j for updating i,j. But it might be that i-1,j is already at n = 10, while i+1,j is at n=2. That is why you need to for the for (n…) loop on the host.

thanx dude. I will try that out. thanx for taking time out to try and help, really do appreciate it