Strange Division by Zero on Large Problem Sizes Large problem sizes crash OpenCL

Hello all!

I am hoping that someone might be able to give me a bit of help with a problem I am having. I am writing a Computational Fluids code using OpenCL, but am having problems when a grid/mesh gets to large sizes.

The mesh I have is 2D, but I am storing it as a large 1D array of cl_float2.

My difficulty is that up to about 15,000 points in the array, my program operates happily. After 15,000 points, the program crashes, throwing an “Integer Division by Zero.” I am using the C++ Bindings for OpenCL. With Visual Studio 2008, the error is the “Division by Zero.” I have not tried compiling with a different compiler.

My investigation so far has led me to:

  • Ensuring that the work group size is not larger than either the device or the kernel can handle (Device States 512, Kernel States 320, a 20,000 point grid might be set to 100 or 200)
  • Attempting to use a 2D Work Item size (Device States 512x512x64, I’m not sure how to get his from the Kernel, and I have tried setting manually to 200x100x1)
  • Altering the grids being used (like I mentioned, they go up to about 15,000 points and, afterwards, give me problems).
  • Running on the CPU (dual core Core2Duo, and quad-core i7 processor, which has Work group size of 1024, and work item sizes of 1024x1024x1024)

I have validated the inputs and outputs from the device and my kernel works fine when the grids are small enough. Unfortunately, for the problems used in CFD, my grids need to be larger or more refined.

My hardware is:
Intel Core 2 Duo CPU @ 3 GHz using ATI-Stream OpenCL 1.0 Driver
3.25 GB RAM
Windows XP 32-bit SP3
nVIDIA GeForce 9500 GT, 1GB DDR2, with 32 GPU cores, using nVIDIA OpenCL 1.1 developer driver 258.19.

So, in short, is there anything obvious that I am doing wrong? I am, at heart, an engineer, not a computer programmer, but I am trying :)

Thanks in advance!

I know the documentation is a bit confusing here. Work group size (localWorkSize) can be at most 512 for x and y and 64 for z. However, they cannot be 512x512x64 as the product also needs to be not larger than 512. So a localWorkSize of 200x100x1 is to large and the kernel will not launch.

Reduce localWorkSize so that xyz <= 512, and increase globalWorkSize instead to match your problem size.

If OpenCL doesn’t like your work group sizes, it should be giving you an error, not crashing (you are checking the return codes of OpenCL function calls, right ?). I suspect your kernel(s) may be accessing out-of-bounds memory (either read or write), which can lead to seemingly random failures further on (but it doesn’t have to, so if this is in fact your error, you may not see it for smaller grid sizes).

I am currently using the C++ bindings for OpenCL. They appear to not pass back a return code when I use the

[codebox]cl::NDRange::NDRange(::size_t size0 ) [/codebox]

constructor, so there’s no way to check it there. Just to verify this, I manually said the local size was 20e8 and used the constructor on that value - it passed the constructor fine (but wiped out during the Calculation).

I have verified that my workgroup dimensions are well-within the limits as stated by Tera (I have a local size of 100x0x0) - should this be 100x1x1 ? I only have a 1-D array of data that I am passing in, so I am unsure of the x0 or x1 at the end of the dimensioning.

Here is my code to launch the kernel, and wait for it to complete running. Is this the correct way to check for errors? This seems to be the way that ATI does it in their examples, which actually use the C++ bindings (I cannot find an nVidia OpenCL example that uses the bindings).

[codebox]void CalcKernel(int nKernel)

{

cl::Event evt;

try

{

	clHandles.clQueue.enqueueNDRangeKernel(	clHandles.clKernels[nKernel],

		cl::NullRange,

		clParams.globalNDRange,

		clParams.localNDRange,

		0,

		&evt); 

	evt.wait();

}

catch(cl::Error clErr)

{

	cerr	<< "cl::Error: " 

		<< clErr.what()

		<< "("

		<< clErr.err()

		<< ")"

		<< endl;

	ReleaseHostBuffers();

}

}[/codebox]

My feeling is leaning towards a memory read/write out of bounds issue - in my basic C++ classes, I dealt often with the issue of corrupted memory due to bounds issues. I’ve double-checked all of my bounds (array sizes, limits, etc), and they are consistent.

Thank you both for the prompt feedback - the forums here are much more helpful than the Khronos.org message boards!

Yes, they should be 100x1x1.