Odd behavior. Bug in opencl implementation?

The below kernel behavior makes absolutely no sense to me. Are there a bug in the OpenCL implementation?

(it is a boiled-down bug-reproducing version of a larger and more meaningful kernel)

The kernel basically initializes a 16x16 array with non-zero integers. Also, it contains a dummy for-loop that, in this version, is irrelevant for the output, but… when the for-loop in the kernel has many iterations (e.g. 1000) the kernel seems to crash, the screen flickers and the output array is not initialized with non-zeros as it should be. With few iterations (e.g. 10) the output array is correctly initialized with non-zeros.

Also, decreasing the outer variable, “rows”, from 3344 to e.g. 144 also makes the code work as expected.

As you can see, I use the python interface, “pyopencl”, as a wrapper because of its syntactical simplicity. Here is a complete bug-reproducing boiled-down example:

(If the program prints “0” you have verified the bug).

import sys

import struct

import pyopencl as cl

import numpy

block_size = 16

matrixLength = 3101104

rows = 3344

row2width = numpy.zeros(rows, numpy.int32)

row2startIdx = numpy.zeros(rows, numpy.int32)

matrix = numpy.zeros(matrixLength, numpy.int32)

pl = cl.get_platforms()

devs = pl[0].get_devices(cl.device_type.GPU)

if(block_size > devs[0].get_info(cl.device_info.MAX_WORK_GROUP_SIZE)):

   print "Error: block_size is larger than MAX_WORK_GROUP_SIZE..."

   exit(1)

ctx = cl.Context(devs)

queue = cl.CommandQueue(ctx)

mf = cl.mem_flags

src = """

// Thread block size

#define BLOCK_SIZE 16

__kernel void test(__global int* C, __global int* A, __global int* rowWidths, __global int* rowStartIdxs)

{

	int bi = get_group_id(0);

	int bj = get_group_id(1);

	int ti = get_local_id(0);

	int tj = get_local_id(1);

	int rowAIdx =  bi * BLOCK_SIZE + ti;

	int rowBIdx =  bj * BLOCK_SIZE + tj;

	int cOut = 1;

	for(int x=0; x<1000; x++) {

	  __local int As[BLOCK_SIZE][BLOCK_SIZE];

	  __local int Bs[BLOCK_SIZE][BLOCK_SIZE];

	  As[ti][tj] = 1;

	  Bs[ti][tj] = 1;

	  barrier(CLK_LOCAL_MEM_FENCE);

	}

	

	if(cOut>0) C[ti*BLOCK_SIZE + tj] += 1;

}

""";

prg = cl.Program(ctx, src).build();

matrix_buf = cl.Buffer(ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=numpy.array(matrix).astype(numpy.int32))

row2width_buf = cl.Buffer(ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=numpy.array(row2width).astype(numpy.int32))

row2startIdx_buf = cl.Buffer(ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=numpy.array(row2startIdx).astype(numpy.int32))

o = numpy.zeros(block_size * block_size).astype(numpy.int32)

o_buf = cl.Buffer(ctx, mf.READ_WRITE | mf.COPY_HOST_PTR, hostbuf=o)

prg.test(queue, [rows, rows], o_buf, matrix_buf, row2width_buf, row2startIdx_buf, local_size=(block_size, block_size))

cl.enqueue_read_buffer(queue, o_buf, o).wait()

print sum(o)

I use macOS 10.6.3, pyopencl-0.91.4 and have just installed gpucomputingsdk_2.3a_macos_32 from NVidia.

My machine is a macbook pro, and get_devices(cl.device_type.GPU) returns

[<pyopencl.Device 'GeForce 9400M' at 0x2022600>, <pyopencl.Device 'GeForce 9600M GT' at 0x1022600>]
for(int x=0; x<1000; x++) {

	  __local int As[BLOCK_SIZE][BLOCK_SIZE];

	  __local int Bs[BLOCK_SIZE][BLOCK_SIZE];

	  As[ti][tj] = 1;

	  Bs[ti][tj] = 1;

	  barrier(CLK_LOCAL_MEM_FENCE);

	}

As and Bs arrays are declared in loop scope. This shouldn’t ever work, you’re effectively creating a local array, writing to it and then discarding it as soon as the iteration is complete.

Have you tried moving those declarations before the loop?

As I wrote in my first post, this is a boil-down of a larger and more meaningful kernel, so As and Bs are not supposed to do anything useful in this example (except for proving the existence of the bug). This is just proof-of-bug code. :-)

Btw, has your OS gpu reset timer? sometimes gpu programs are not allowed to run long by os. And I am not sure about barrier inside the loop.

Problem solved! It seems that Nvidia hardware that is also used for display has a hard limit of 5 seconds of kernel execution time. After that, the kernel is aborted and an error returned. This hard limit could have been far better documented, IMHO, since it is pretty essential for computation intensive applications (most OpenCL programs I guess).

I solved the issue by splitting the problem into smaller subproblems and constructing the final solution from the sub results.

This is not about nvidia’s hardware, you ran into the OS’s watchdog timer. This is well documented, search the forum for “watchdog timer”.