Goto in device code Is goto supported in device code?

Are goto statements supported in device code in opencl?

What I want to do iterate through a network of functions, in random order as determined by the weights of each link in the network. Basically, it’s an extension of the flame IFS algorithm, where at every iteration, you choose a function from a list of functions specific to the last function called. e.g.

void iterate()

{

	int iteration = 0;

xform1:

	doXform1();

	iteration++;

	if (iteration > MAX_ITERATIONS)

		return;

	float r = rand();

	if (r < xform1_1_weight)

		goto xform1;

	else if (r < xform1_2_weight)

		goto xform2;

	.....

xform2:

	doXform2();

	....

xform3:

....

}

The actual code is dynamically generated, so there could be any number of xforms, each with arbitrary code, as well as an arbitrary linkage to the next xform.

I checked the spec and found no mention of the flow control allowed, but I am almost certain that ‘switch’ at least is NOT supported.

In order to use a parallel programing technology effectively, the most important task to do first is identify what can be done in parallel. Looking at your example, unless there is something above this level, there is no parallelism. No parallelism == no faster than a cpu, generally. Something like writing a kernel where it does a rand(), and just goes thru a series of if-than-else’s, calling the functions directly sounds parallel. Each work unit handles a different iteration. A problem might be you have to write rand(), since I do not see it listed. The traditional rand is based off a seed value. You might use get_global_id(), so that each iteration has a different seed.

One thing i did see on flow control is recursion is NOT supported.

Iterate operates on a large pool of points, so there’s no shortage of parallelism. In addition, rand() actually uses an internal seed which is defined on a per-warp basis, so there’s no divergence. The details of all that are out of the scope of the example, I just wanted to put the bit that is relevant to the question at hand.

I ran a test kernel with a goto on OSX. It compiled & ran and produced the expected result. I change it slightly to also use the rand() function, but it cannot be found.

[codebox] 0 ( 0 - 21) kernel void gotoTest(

1 ( 21 - 49) const int intParm

2 ( 49 - 78) , global float *output)

3 ( 78 - 79) {

4 ( 79 - 111) if(intParm > 6) goto label1;

5 ( 111 - 127) *output = 3;

6 ( 127 - 138) return;

7 ( 138 - 145) label1:

8 ( 145 - 171) *output = 84 * rand();

9 ( 171 - 172) }


Error occurred on Context: 0 (GeForce 9400M)

[CL_BUILD_ERROR] : OpenCL Build Error : Compiler build log:

kernel referenced an external function rand, that could not be found.

[/codebox]

Rand() also cannot be found in Nvidia’s OpenCL Programmer’s Guide. It may indeed exist, at the Cuda level. How you would get to it, I do not know, but it is not included in OpenCL universally.

While I was at it, I tried ‘switch’ and it also works. You never asked, but threads should not have mis-info. Never tried it myself, but could have swore there was a thread where someone said it did not work. A search brought up nothing like though.

rand() is a host function from the C library (cstdlib IIRC). It’s not a language feature, it’s not supposed to work unless you actually have an implementation somewhere.

As far as I know OpenCL doesn’t define implementations of any of the popular C functions for use within kernels. AFAIR CUDA has a device clock() implementation that does the equivalent of popular host clock() from the host ctime library, but there’s no rand() either.

Excellent! That means I don’t have to go to bizarre lengths to emulate gotos with normal function calls!

As for rand(), I know that it’s not built in, so I have my own implementation. Here’s the code from my old Cuda version if anyone’s interested. It’s a Combined Tausworthe Generator, adjusted to use partial state sharing across a warp to allow greater performance and smaller total state while still retaining good quality.

__constant__ unsigned int shift1[4] = {6, 2, 13, 3};

__constant__ unsigned int shift2[4] = {13, 27, 21, 12};

__constant__ unsigned int shift3[4] = {18, 2, 7, 13};

__constant__ unsigned int offset[4] = {4294967294, 4294967288, 4294967280, 4294967168};

__shared__ unsigned int randStates[32];

__device__ unsigned int TausStep(unsigned int &z, int S1, int S2, int S3, unsigned int M)

{

	unsigned int b = (((z << S1) ^ z) >> S2);

	return z = (((z &M) << S3) ^ b);

}

__device__ unsigned int randInt()

{

	TausStep(randStates[threadIdx.x&31], shift1[threadIdx.x&3], shift2[threadIdx.x&3],shift3[threadIdx.x&3],offset[threadIdx.x&3]);

	return (randStates[(threadIdx.x)&31]^randStates[(threadIdx.x+1)&31]^randStates[(threadIdx.x+2)&31]^randStates[(threadIdx.x+3)&31]);

}

__device__ float randFloat()

//This function returns a random float in [0,1] and updates seed

{

	unsigned int y = randInt();

	return __int_as_float((y&0x007FFFFF)|0x3F800000)-1.0f;

}

__device__ float randFloatWarp()

//This function is a workaround for getting a warp wide rand number

{

	unsigned int y = randInt();

	return __int_as_float((randStates[(threadIdx.x&31)]&0x007FFFFF)|0x3F800000)-1.0f;

}

...

		w=randFloatWarp();

		if ((threadIdx.x&31)!=0)	//workaround for strange bug with compute 1.0 hardware where device crashes if all threads in a warp try to read from the same address

			w = __int_as_float((randStates[0]&0x007FFFFF)|0x3F800000)-1.0f;

...

You might notice the strange workaround in randFloatWarp. This is there because of an apparent bug in compute 1.0 hardware where having an entire warp reading from a single location in shared memory causes a crash (note that 1.3 hardware works as expected in this case).