Converting from Loops to ND Ranges Question about converting a main loop to an NDRange for kernel

HolyGeneralK · October 17, 2010, 11:43pm

Hello,

I am trying to finish up my thesis code this week, and I’m completely stumped on one aspect that I thought was pretty intuitive. My code solves the 2D Compressible Navier-Stokes equations using an explicit method using OpenCL. My first step was translating a Fortran code (used 2D arrays) to a C++ code (using 1D arrays). The main solution step involves several thousand iterations of 8 different functions. I have been able to convert 7 of these 8 with no problem. It is this last one that I am having problems with, and unfortunately, it’s also the most important one of the code. It’s also the one where the largest benefits from parallelization could be seen.

My problem domain/grid is a rectangular grid of size 181x65, I.E., there are 181 columns of data, and 65 rows of data. These are termed imax and jmax, respectively.

This function involves really 2 main loops.

The first loop goes over the row

for(int j=2;j<=jmax;j++)

{

...Calculate Stuff On Rows...

}

and

for(int i=2; i<=imax;i++)

{

...Calculate More stuff, on columns, here...

}

I was able to basically copy-paste all of my functions from the C++ code into a kernel, and launch them through clEnqueueTask, and obtain the exact same results that the serial CPU code used. I then converted the functions one by one into kernels that would run using an NDRangeKernel, with a 2D range. After each one, I tested thoroughly, and was able to get 7 of the 8 done.

I broke the previous kernel into two kernels, that remained working as tasks, so long as the first one finished before the second one. [My goal is eventually to get them running side by side and then combine the results later, but I need this first step to work before I attempt that]. So, for example, my first task looked something like:

__kernel void calcFirstThing(arguments)

{

for(int j=2;j<=jmax;j++)

{

...Calculate Stuff on rows here...

}

}

This kernel has the outer loop over the rows, and then has a nested loop that goes over the columns. I have a function that takes my two indices and converts them into a memory address. Again, this works fine as a task. The math only cares about things on the same row, and doesn’t jump to different rows.

My problem arose when I tried to remove the outer loop from the kernels, and use a 1D NDRange to replace the loop. So, for example, my kernel would look like:

__kernel void calcFirstThing(arguments)

{

int j = get_global_id(0);

if(j < 2) return;			   // Maintain lower bound of loop

...Calculate Stuff on rows here...

}

and I would enqueue the kernel via:

oclError = clEnqueueNDRangeKernel(queue, kernel, 1, offset, globalworksize, localworksize, 0, NULL, &event);

clEnqueueBarrier(queue);

clFlush(queue);

clFinish(queue);

// Check error...

where offset = 0, globalworksize=jmax+1, localworksize=1.

From my understanding, this should launch one kernel at a time until the global worksize is achieved, so it should launch kernels with a global location from 0 through 65, and because of the conditional inside the kernel, should return if the global location is 0 or 1 (I have boundary cell references and I don’t want to step out of bounds on them). However, what I notice is that the results are nowhere near the same…after anywhere from one or two times through the main loop to several hundred times, depending on my input conditions, my answer “blows up” and starts dividing by zero. This is confounding me, because other than the outer loop of the kernel being removed, I have not changed anything else.

So, now that my problem description is described to you all, I wonder the following things:

The counters for the NDRange start at zero and go up to size-1, right? So if I want to run up to, and include, imax, my global work size should be imax+1 ?
Is my logic even remotely sound on switching out loops for an NDRange? I’m not a computer scientist, and most of my programming experience is Fortran/basic C++, and I’m still getting my head around OpenCL.
Lastly, I have five arrays in the OpenCL kernel, declared as:

__local float4 MyArray[200]; (similarly for the other arrays).

I know local memory is limited, and is probably hardware dependent, but if I am launching with a local work size of 1, this Array should only be present once, correct? I’m using this array to store some temporary values from each row or column depending on which kernel I am in.

So I’m hoping that someone might have some advice on these matters. I’d be extremely grateful if someone could help me out! Thank you so much!

Update:I got some help at the Khronos OpenCL forums, on a topic of the same title. There were a few things I was confused, but I didn’t know I was confused, so I was doing it wrong.

HolyGeneralK · October 17, 2010, 11:43pm

Hello,

I am trying to finish up my thesis code this week, and I’m completely stumped on one aspect that I thought was pretty intuitive. My code solves the 2D Compressible Navier-Stokes equations using an explicit method using OpenCL. My first step was translating a Fortran code (used 2D arrays) to a C++ code (using 1D arrays). The main solution step involves several thousand iterations of 8 different functions. I have been able to convert 7 of these 8 with no problem. It is this last one that I am having problems with, and unfortunately, it’s also the most important one of the code. It’s also the one where the largest benefits from parallelization could be seen.

My problem domain/grid is a rectangular grid of size 181x65, I.E., there are 181 columns of data, and 65 rows of data. These are termed imax and jmax, respectively.

This function involves really 2 main loops.

The first loop goes over the row

for(int j=2;j<=jmax;j++)

{

...Calculate Stuff On Rows...

}

and

for(int i=2; i<=imax;i++)

{

...Calculate More stuff, on columns, here...

}

I was able to basically copy-paste all of my functions from the C++ code into a kernel, and launch them through clEnqueueTask, and obtain the exact same results that the serial CPU code used. I then converted the functions one by one into kernels that would run using an NDRangeKernel, with a 2D range. After each one, I tested thoroughly, and was able to get 7 of the 8 done.

I broke the previous kernel into two kernels, that remained working as tasks, so long as the first one finished before the second one. [My goal is eventually to get them running side by side and then combine the results later, but I need this first step to work before I attempt that]. So, for example, my first task looked something like:

__kernel void calcFirstThing(arguments)

{

for(int j=2;j<=jmax;j++)

{

...Calculate Stuff on rows here...

}

}

This kernel has the outer loop over the rows, and then has a nested loop that goes over the columns. I have a function that takes my two indices and converts them into a memory address. Again, this works fine as a task. The math only cares about things on the same row, and doesn’t jump to different rows.

My problem arose when I tried to remove the outer loop from the kernels, and use a 1D NDRange to replace the loop. So, for example, my kernel would look like:

__kernel void calcFirstThing(arguments)

{

int j = get_global_id(0);

if(j < 2) return;			   // Maintain lower bound of loop

...Calculate Stuff on rows here...

}

and I would enqueue the kernel via:

oclError = clEnqueueNDRangeKernel(queue, kernel, 1, offset, globalworksize, localworksize, 0, NULL, &event);

clEnqueueBarrier(queue);

clFlush(queue);

clFinish(queue);

// Check error...

where offset = 0, globalworksize=jmax+1, localworksize=1.

From my understanding, this should launch one kernel at a time until the global worksize is achieved, so it should launch kernels with a global location from 0 through 65, and because of the conditional inside the kernel, should return if the global location is 0 or 1 (I have boundary cell references and I don’t want to step out of bounds on them). However, what I notice is that the results are nowhere near the same…after anywhere from one or two times through the main loop to several hundred times, depending on my input conditions, my answer “blows up” and starts dividing by zero. This is confounding me, because other than the outer loop of the kernel being removed, I have not changed anything else.

So, now that my problem description is described to you all, I wonder the following things:

The counters for the NDRange start at zero and go up to size-1, right? So if I want to run up to, and include, imax, my global work size should be imax+1 ?
Is my logic even remotely sound on switching out loops for an NDRange? I’m not a computer scientist, and most of my programming experience is Fortran/basic C++, and I’m still getting my head around OpenCL.
Lastly, I have five arrays in the OpenCL kernel, declared as:

__local float4 MyArray[200]; (similarly for the other arrays).

I know local memory is limited, and is probably hardware dependent, but if I am launching with a local work size of 1, this Array should only be present once, correct? I’m using this array to store some temporary values from each row or column depending on which kernel I am in.

So I’m hoping that someone might have some advice on these matters. I’d be extremely grateful if someone could help me out! Thank you so much!

Update:I got some help at the Khronos OpenCL forums, on a topic of the same title. There were a few things I was confused, but I didn’t know I was confused, so I was doing it wrong.

karbous · October 21, 2010, 8:26pm

Have you find out answers to all your questions or are there still some confusing topics left?

karbous · October 21, 2010, 8:26pm

Have you find out answers to all your questions or are there still some confusing topics left?