clUtil - making OpenCL as easy to use as CUDA

I’ve written a library that abstracts alot of the tedium away from OpenCL. It makes assumptions about devices, contexts, etc. that significantly reduces the number of handles you have floating around as well as the number of function calls you need to do anything. Consider the following code that initializes an array with 20 in each element:

#include <Opencl/cl.h>

char const* kernelSource = "__kernel void fill(__global float* array, unsigned int arrayLength, float val)"


"    if(get_global_id(0) < arrayLength)"

"    {"

"        array[get_global_id(0)] = val;}"

"    }"


int main(int argc, char** argv)


    float val = 20.0f;

    float array[2000]

    cl_int err;

    cl_platform_id platform;

    cl_device_id device;

    cl_context context;

    cl_command_queue commandQueue;

    cl_mem buffer;

    cl_program program;

    cl_kernel kernel;

    unsigned int length = 2000;


    err = clGetPlatform_IDs(1, &platform, NULL);

    err = clGetDeviceIDs(platform, CL_DEVICE_TYPE_ANY, 1, &device, NULL);

    context = clCreateContext(NULL, 1, &device, NULL, NULL, &err);

    commandQueue = clCreateCommandQueue(context, device, 0, &err);

    program = clCreateProgramWithSource(context, 1, &kernelSource, 0, &err);

    err = clBuildProgram(program, 1, &device, NULL, NULL, NULL);

    kernel = clCreateKernel(program, "fill", &err);

//Allocate memory    

    buffer = clCreateBuffer(context, CL_MEM_READ_WRITE, sizeof(array), NULL, &err);

//Actually call the kernel

    err = clSetKernelArg(kernel, 0, sizeof(buffer), &buffer);

    err = clSetKernelArg(kernel, 1, sizeof(length), &length);

    err = clSetKernelArg(kernel, 2, sizeof(val), &val);

size_t global;

    size_t local = 64;

global = length % local == 0 ? length : (length / local + 1) * local;

err = clEnqueueNDRangeKernel(commandQueue, kernel, 1, NULL, &global, &local, 0, NULL, NULL);

//Copy data back

    err = clEnqueueReadBuffer(commandQueue, buffer, CL_TRUE, 0, sizeof(array), array, 0, NULL, NULL);

//Free the data on the GPU




__kernel void fill(__global float* array, unsigned int arrayLength, float val)


    if(get_global_id(0) < arrayLength)


        array[get_global_id(0)] = val;



#include <clUtil.h>

int main(int argc, char** argv)


    const char* kernelFiles[] = {""};

    cl_mem buffer;

    float array[2000];

    unsigned int length = 2000;

    float val = 20.0f;

clUtilInitialize(kernelFiles, 1);

    clUtilAlloc(sizeof(array), &buffer);

clUtilEnqueueKernel("fill", clUtilGrid(length, 64), buffer, length, val);

    clUtilDeviceGet(array, sizeof(array), buffer);



It uses lots of C++0x features, so you’ll need at least gcc4.3. It presently targets Linux. You have to build it from source. Enjoy.



This looks like great work, and should ease the porting of CUDA applications to OpenCL. A few questions


1. Is this a long term research project, i.e., is this something that you’re likely to maintain for the foreseeable future?

2. From looking at the documentation, the missing features that I require before using this are 1d texture support with 16-bit integer to float conversion and asynchronous memory copying, with the former being more of a show stopper than the latter. Any ETA on these?

I’m one of the maintainers of the QUDA library, which is a collection of optimized linear solvers for lattice QCD (particle physics done a finite 4d spacetime grid). We’ve been thinking about OpenCL for sometime but our inertia has been significant. This could really ease the porting for us (it doesn’t help the other major issue we have with OpenCL which is the lack of C++ template support…)

I do quite a bit of research using OpenCL and will probably continue maintaining it as I use it in the future.

As for 1-D textures, they aren’t inherently supported by OpenCL. Depending what you’re trying to do, you can probably emulate them with 2-D textures declared with CL_UNORM_INT16 and CL_A and write wrapper functions that convert a 1-D index into a 2-D index when sampling. These would probably be macros or functions that would go in one of your .cl files.

As for asynchronous execution, I plan to add this in the next few months. It will probably be implemented using lambdas, so you don’t need a bunch of callback functions cluttering up your code.

Looks interesting, but I’m just wondering, what is your motivation to create this in addition to the official OpenCL C++ wrapper that’s already available, and which also reduces the lines of code to write quite a bit?

From cursory glance, it looked like it put a band-aid on a shotgun wound. You still have to manage command queues, devices, programs, kernels, and all that junk.

Just an update, clUtil now supports asynchronous execution as well as 1-D textures. An example of each of these is included in the examples directory.

Update (2/25): I’ve started adding functions callable from kernels for doing simple tasks within a work group (such as sort, sum, max, min, scan, etc.) Most people who have done work on these have these as individual kernels that operate on large arrays, which isn’t as useful if you for example, want to sum a bunch of arrays in parallel. These helpers fill in the gap by providing routines that work on a work-group basis on smaller arrays, so you can do lots of them in parallel alongside other operations in a single kernel.

Also, I’ve added out-of-order execution mode.

If you have any requests, send it to the clUtil users group: