The below kernel behavior makes absolutely no sense to me. Are there a bug in the OpenCL implementation?
(it is a boiled-down bug-reproducing version of a larger and more meaningful kernel)
The kernel basically initializes a 16x16 array with non-zero integers. Also, it contains a dummy for-loop that, in this version, is irrelevant for the output, but… when the for-loop in the kernel has many iterations (e.g. 1000) the kernel seems to crash, the screen flickers and the output array is not initialized with non-zeros as it should be. With few iterations (e.g. 10) the output array is correctly initialized with non-zeros.
Also, decreasing the outer variable, “rows”, from 3344 to e.g. 144 also makes the code work as expected.
As you can see, I use the python interface, “pyopencl”, as a wrapper because of its syntactical simplicity. Here is a complete bug-reproducing boiled-down example:
(If the program prints “0” you have verified the bug).
import sys
import struct
import pyopencl as cl
import numpy
block_size = 16
matrixLength = 3101104
rows = 3344
row2width = numpy.zeros(rows, numpy.int32)
row2startIdx = numpy.zeros(rows, numpy.int32)
matrix = numpy.zeros(matrixLength, numpy.int32)
pl = cl.get_platforms()
devs = pl[0].get_devices(cl.device_type.GPU)
if(block_size > devs[0].get_info(cl.device_info.MAX_WORK_GROUP_SIZE)):
print "Error: block_size is larger than MAX_WORK_GROUP_SIZE..."
exit(1)
ctx = cl.Context(devs)
queue = cl.CommandQueue(ctx)
mf = cl.mem_flags
src = """
// Thread block size
#define BLOCK_SIZE 16
__kernel void test(__global int* C, __global int* A, __global int* rowWidths, __global int* rowStartIdxs)
{
int bi = get_group_id(0);
int bj = get_group_id(1);
int ti = get_local_id(0);
int tj = get_local_id(1);
int rowAIdx = bi * BLOCK_SIZE + ti;
int rowBIdx = bj * BLOCK_SIZE + tj;
int cOut = 1;
for(int x=0; x<1000; x++) {
__local int As[BLOCK_SIZE][BLOCK_SIZE];
__local int Bs[BLOCK_SIZE][BLOCK_SIZE];
As[ti][tj] = 1;
Bs[ti][tj] = 1;
barrier(CLK_LOCAL_MEM_FENCE);
}
if(cOut>0) C[ti*BLOCK_SIZE + tj] += 1;
}
""";
prg = cl.Program(ctx, src).build();
matrix_buf = cl.Buffer(ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=numpy.array(matrix).astype(numpy.int32))
row2width_buf = cl.Buffer(ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=numpy.array(row2width).astype(numpy.int32))
row2startIdx_buf = cl.Buffer(ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=numpy.array(row2startIdx).astype(numpy.int32))
o = numpy.zeros(block_size * block_size).astype(numpy.int32)
o_buf = cl.Buffer(ctx, mf.READ_WRITE | mf.COPY_HOST_PTR, hostbuf=o)
prg.test(queue, [rows, rows], o_buf, matrix_buf, row2width_buf, row2startIdx_buf, local_size=(block_size, block_size))
cl.enqueue_read_buffer(queue, o_buf, o).wait()
print sum(o)
I use macOS 10.6.3, pyopencl-0.91.4 and have just installed gpucomputingsdk_2.3a_macos_32 from NVidia.
My machine is a macbook pro, and get_devices(cl.device_type.GPU) returns
[<pyopencl.Device 'GeForce 9400M' at 0x2022600>, <pyopencl.Device 'GeForce 9600M GT' at 0x1022600>]