Mandelbrot Fractal Issues(pls help)

Hello everyone! I’m currently a student at Edinboro University of PA working on a senior project. I’m trying to draw a mandelbrot fractal using openCL. It works on the first zoom, however on the second zoom, or just not doing anything after the first and the display driver breaks. I have the newest nvidia drivers for my card (8800 gts).

Here is my kernel code. I’m fairly certain it’s not my c code causing the issue. Am I using too much memory? I must be overwriting some memory somewhere.

const char* CreateImageSource[] = {

"__kernel void CreateImage(__global float* image, __global float* fractalProperties, __global unsigned int* fpOperations)",

"{",

	"unsigned int n = get_global_id(0);",

	"int iterations = 0;",

	"float realC = fractalProperties[0] + (n % 800) * fractalProperties[4];", //2

	"float imaginaryC = fractalProperties[2] + (int)(n / 800) * fractalProperties[5];", // 2

	"float realA = 0;",

	"float imaginaryA = 0;",

	"float magnitude = sqrt(pow(realC, 2) + pow(imaginaryC, 2));", //5

	"float hue = 0;",

	"float value = .556789;",

	"float saturation = 1;",

	"float red = 0;",

	"float green = 0;",

	"float blue = 0;",

	"int i = 0;",

	"float f = 0;",

	"float p = 0;",

	"float q = 0;",

	"float t = 0;",

	"fpOperations[n] = 0;", // 9 up to this point

	"while(magnitude >  fractalProperties[7] && magnitude < fractalProperties[8] && iterations < fractalProperties[6])", // 3

	"{",

		"float tempRealA = (realA * realA) - (imaginaryA * imaginaryA);", // 3

		"imaginaryA = (realA * imaginaryA) + (imaginaryA * realA);", // 3

		"realA = tempRealA + realC;", // 1

		"imaginaryA = imaginaryA + imaginaryC;", // 1

		"magnitude = sqrt(pow(realA, 2) + pow(imaginaryA, 2));", // 3

		"iterations = iterations + 1;",

		"fpOperations += 14;",

	"}",

	"if(iterations < 20 || (int)(iterations / fractalProperties[6]) == 1)",

	"{",

		"image[n*3] = 0.0;",

		"image[n*3+1] = 0.0;",

		"image[n*3+2] = 0.0;",

	"}",

	"else",

	"{",

		"hue = 360 * ((float)iterations / (float)fractalProperties[6]);",	// 2

		"hue = hue / (float)60;", // 1

		"i = (int)floor(hue);", // 1

		"f = hue - i;", // 1

		"p = value * ( 1 - saturation );", // 2

		"q = value * ( 1 - saturation * f );", // 3

		"t = value * ( 1 - saturation * ( 1 - f ) );", //4

		"switch( i )",

		"{",

			"case 0:",

			"{",

				"red = value;",

				"green = t;",

				"blue = p;",

				"break;",

			"}",

			"case 1:",

			"{",

				"red = q;",

				"green = value;",

				"blue = p;",

				"break;",

			"}",

			"case 2:",

			"{",

				"red = p;",

				"green = value;",

				"blue = t;",

				"break;",

			"}",

			"case 3:",

			"{",

				"red = p;",

				"green = q;",

				"blue = value;",

				"break;",

			"}",

			"case 4:",

			"{",

				"red = t;",

				"green = p;",

				"blue = value;",

				"break;",

			"}",

			"default:",

			"{",

				"red = value;",

				"green = p;",

				"blue = q;",

				"break;",

			"}",

		"}",

		"image[n*3] = red;",

		"image[n*3+1] = green;",

		"image[n*3+2] = blue;",

	"}",

"}"

};

Image is a float[8008003], fractalProperties is float[9], and fpOperations is size float[800*800].

So after some searching I think I am RangeEnqueing wrong. I range enqueue for 800*800, the size of my image. However a kernel work group info indicated somewhere around 400 something for my max work size.

If I can’t user a number greater than this for my work size, and I can’t tell it what number to start my threads at how can I get the correct global ID to access the correct spot in memory.

I’m just confused how all this work group stuff works out, how does one break up 800*800 so that it can be worked on?

So I did some research and found out that you can indeed offset the global work index. I have since fixed my error but run into another problem (not error).

I get the max work group size with

error = clGetKernelWorkGroupInfo(createImageKernelCL, platformDevices[0]->deviceIds[0], CL_KERNEL_WORK_GROUP_SIZE, sizeof(size_t), &maxWorkgroupSize, NULL);

I then use this in a loop to enqueue kernels to run. I was basically using 800*800 for work dimensions before with a 0 offset, this was causing crashes to the display driver. This is the loop.

unsigned int iterations = (xyPixels * xyPixels) / (unsigned int)maxWorkgroupSize;

	size_t leftOverWorkgroupSize = (xyPixels * xyPixels) % (unsigned int)maxWorkgroupSize;

	cout << "Max Kernel Work Group Size " << maxWorkgroupSize << " , iterations " << iterations << ", left over work size " << leftOverWorkgroupSize << endl;

	

	for(unsigned int i = 0 ; i < iterations - 1; i++)

	{

		globalIDOffset = i * maxWorkgroupSize;

		clEnqueueNDRangeKernel(commandQueueIds[0], createImageKernelCL, 1, &globalIDOffset, &maxWorkgroupSize, NULL, 0, NULL, NULL);

	}

	globalIDOffset = iterations * maxWorkgroupSize;

	clEnqueueNDRangeKernel(commandQueueIds[0], createImageKernelCL, 1, &globalIDOffset, &leftOverWorkgroupSize, NULL, 0, NULL, NULL);

I divide my workspace by my max workgroup size. I then have the number of iterations I can do. I go to iterations - 1. And on iterations I just change my worksize to the leftover space.

The problem I’m having is, even though this solves my crash, it runs much MUCH slower. clEnqueueNDRange seems to be a blocking command so I can’t rapidly enqueue kernel executions. So I have to wait for one to finish before even enqueueing the next.

So my question is, am I doing this wrong? I’m just having so much trouble trying to figure out workspaces and indexes. I understand I’m just a senior in college and probably can’t hang with most of you but I could really use your help. openCL is really cool and can really rock out some flops when done right.

Thanks to anyone that can help me.

I have resolved this issue. I am only posting in case anyone does a search and sees this forum post.

Basically windows has a watchdog timer, and during the while loop in my kernel (which is a sequential operation) it will recover the display driver. Basically I need to get a new card to use only for openCL, break up my kernel code into smaller kernels, lower the amount of iterations I do in my while loop, or use Linux, which doe snot have a watchdog timer.

Sorry for the late reply.

The CL_KERNEL_WORK_GROUP_SIZE refers to the maximum allowed local size (which you let the driver define by passing NULL) not to the maximum global size (which is unbounded in practice). In no case whatsoever the developer has to divide the runs by himself into pieces like that. So this seems to be a driver bug. Have you tried passing the local size explicitly? Remember that the global size has to be evenly divisible by the local size.