Is task parallel programming (aka "concurrent kernels") in OpenCL supported?

According to the OpenCL spec section 3.4.2, task parallel programming (TPP) (aka “concurrent kernels” as in white paper on Fermi) should be possible by executing a single work-item in a work-group with an NDSpace of {1,1,1}. To check this, I wrote a program that tries to execute TPP. I then used that to check various platforms and devices for support.

What I’m finding is that NVIDIA’s OpenCL does not seem to support TPP. It does for AMD OpenCL for my quad-core. In contrast, it’s supported in the CUDA Runtime API. (My tests use the latest driver, 280.19 for OpenCL 1.1 on Windows 7.)

My question is: Do you have code that actually has concurrent kernel execution working in the NVIDIA OpenCL on a Fermi? Or, is task parallel programming just not supported?

Ken

[b]My test code:

[/b]

The kernel for this test simply increments global memory in a for-loop, a “position” that is task dependent.

__kernel void aaa(__global int * v, int times, int position)

{

	for (int i = 0; i < times; ++i)

	{

		v[position]++;

	}

}

I didn’t want to use atomics and synchronization because the OpenCL memory model relaxes consistency between work-items in different work-groups. This code spins around for a long, but finite, time. (Of course, different kernels could be executed for this to be a good TPP example. But, clearly, I’m not trying to do SIMD data parallelism here, otherwise I would have used an { n, 1, 1} NDSpace for one kernel code.)

In my program, the kernel is called two ways: “sequential” and “concurrent”.

The sequential code follows:

void sequential(cl_platform_id platform, cl_device_id device, cl_program program, cl_context context,

    int write_distance)

{

    cl_int err;

struct _timeb  t1;

    struct _timeb  t2;

    std::cout << "Starting sequential...";

    _ftime_s(&t1);

// Create memory for writes in the kernels.

    size_t asize = sizeof(cl_int) * write_distance * opt->children;

    cl_int * a = (cl_int*) malloc(asize);

    memset(a, 0, asize);

cl_int r1;

    cl_mem da = clCreateBuffer(context, CL_MEM_READ_WRITE | CL_MEM_COPY_HOST_PTR, asize, a, &r1);

    CHECK(r1);

// Set up NDSpace, basically one work-item in the entire NDSpace.

    size_t tile[3] = {1, 1, 1};

    size_t tiles[3]  = {1, 1, 1};

    int max_dimensionality = 3;

// Create kernels with arguments, command queue, queue kernel, flush, and wait, ALL SEQUENTIALLY.

    cl_kernel * kernels = (cl_kernel *)malloc(opt->children * sizeof(cl_kernel));

    cl_command_queue * queues = (cl_command_queue *)malloc(opt->children * sizeof(cl_command_queue));

    cl_event * events = (cl_event *)malloc(opt->children * sizeof(cl_event));

    for (int i = 0; i < opt->children; ++i)

    {

        kernels[i] = clCreateKernel(program, "aaa", &err);

        CHECK(err);

        err = clSetKernelArg(kernels[i], 0, sizeof(cl_mem), (void *) &da);

        CHECK(err);

        err = clSetKernelArg(kernels[i], 1, sizeof(cl_int), (void *) &opt->times);

        CHECK(err);

        int position = i * write_distance;

        err = clSetKernelArg(kernels[i], 2, sizeof(cl_int), (void *) &position);

        CHECK(err);

        queues[i] = clCreateCommandQueue(context, device, 0, &err);

        CHECK(err);

        err = clEnqueueNDRangeKernel(queues[i], kernels[i], max_dimensionality, NULL, tiles, tile, 0, NULL, &events[i]);

        CHECK(err);

        err = clFlush(queues[i]);

        CHECK(err);

        err = clWaitForEvents(1, &events[i]);

        CHECK(err);

        err = clReleaseCommandQueue(queues[i]);

        CHECK(err);

        err = clReleaseKernel(kernels[i]);

        CHECK(err);

    }

    // read output array

    cl_command_queue cmd_queue = clCreateCommandQueue(context, device, 0, &err);

    CHECK(err);

    err = clEnqueueReadBuffer(cmd_queue, da, CL_TRUE, 0, asize, a, 0, NULL, NULL);

    CHECK(err);

    err = clReleaseCommandQueue(cmd_queue);

    CHECK(err);

    err = clReleaseMemObject(da);

    CHECK(err);

_ftime(&t2);

    std::cout << (double)(t2.time - t1.time + ((double)(t2.millitm - t1.millitm))/1000) << " s.\n";

 }

In essence, this routine creates “children” number of tasks, each on its own cl_command_queue, which is clFlush’ed immediately. This code forces serialization of each task because a task is not created and executed until the previous is finished executing on the device.

The concurrent code follows:

void concurrent(cl_platform_id platform, cl_device_id device, cl_program program, cl_context context,

    int write_distance)

{

    cl_int err;

struct _timeb  t1;

    struct _timeb  t2;

    std::cout << "Starting concurrent...";

    _ftime_s(&t1);

// Create memory for writes in the kernels.

    size_t asize = sizeof(cl_int) * write_distance * opt->children;

    cl_int * a = (cl_int*) malloc(asize);

    memset(a, 0, asize);

cl_int r1;

    cl_mem da = clCreateBuffer(context, CL_MEM_READ_WRITE | CL_MEM_COPY_HOST_PTR, asize, a, &r1);

    CHECK(r1);

// Set up NDSpace, basically one work-item in the entire NDSpace.

    size_t tile[3] = {1, 1, 1};

    size_t tiles[3]  = {1, 1, 1};

    int max_dimensionality = 3;

// Create kernels with arguments, command queue, queue kernel, flush, and wait, ALL CONCORRENTLY.

    cl_kernel * kernels = (cl_kernel *)malloc(opt->children * sizeof(cl_kernel));

    cl_command_queue * queues = (cl_command_queue *)malloc(opt->children * sizeof(cl_command_queue));

    cl_event * events = (cl_event *)malloc(opt->children * sizeof(cl_event));

    for (int i = 0; i < opt->children; ++i)

    {

        kernels[i] = clCreateKernel(program, "aaa", &err);

        CHECK(err);

        err = clSetKernelArg(kernels[i], 0, sizeof(cl_mem), (void *) &da);

        CHECK(err);

        err = clSetKernelArg(kernels[i], 1, sizeof(cl_int), (void *) &opt->times);

        CHECK(err);

        int position = i * write_distance;

        err = clSetKernelArg(kernels[i], 2, sizeof(cl_int), (void *) &position);

        CHECK(err);

        queues[i] = clCreateCommandQueue(context, device, 0, &err);

        CHECK(err);

    }

    for (int i = 0; i < opt->children; ++i)

    {

        err = clEnqueueNDRangeKernel(queues[i], kernels[i], max_dimensionality, NULL, tiles, tile, 0, NULL, &events[i]);

        CHECK(err);

    }

    for (int i = 0; i < opt->children; ++i)

    {

        err = clFlush(queues[i]);

        CHECK(err);

    }

    for (int i = 0; i < opt->children; ++i)

    {

        err = clWaitForEvents(1, &events[i]);

        CHECK(err);

    }

    for (int i = 0; i < opt->children; ++i)

    {

        err = clReleaseCommandQueue(queues[i]);

        CHECK(err);

        err = clReleaseKernel(kernels[i]);

        CHECK(err);

    }

    // read output array

    cl_command_queue cmd_queue = clCreateCommandQueue(context, device, 0, &err);

    CHECK(err);

    err = clEnqueueReadBuffer(cmd_queue, da, CL_TRUE, 0, asize, a, 0, NULL, NULL);

    CHECK(err);

    err = clReleaseCommandQueue(cmd_queue);

    CHECK(err);

    err = clReleaseMemObject(da);

    CHECK(err);

_ftime(&t2);

    std::cout << (double)(t2.time - t1.time + ((double)(t2.millitm - t1.millitm))/1000) << " s.\n";

}

This code creates “children” number of tasks all at once. Then, each tasked is placed on its own queue. Then, each is queue is clFlush’ed. This code executes task concurrently because each task is much longer than the time it takes to be placed on a command queue.

My program accepts command-line parameters to adjust to choose the platform, device, the number of threads to create, the number of times the for-loop in the kernel loops, and how wide to spread memory writes from one thread to the next.

The complete MSVC++ 2010 solution is here.

Results:

When I run this on my quad-core using the AMD OpenCL platform, I get a nice speed up around 4x, which is what I would expect:

$ ocl-task-parallel.exe --platform 1 --children 10 --times 123456789

Number of platforms = 2

Platform profile: FULL_PROFILE

Platform version: OpenCL 1.1 AMD-APP-SDK-v2.5 (684.212)

Platform name: AMD Accelerated Parallel Processing

Platform vendor: Advanced Micro Devices, Inc.

Platform extensions: cl_khr_icd cl_amd_event_callback cl_amd_offline_devices cl_

khr_d3d10_sharing

devices = 1

            Device [0]

                type                          = CL_DEVICE_TYPE_CPU

                name                          = Intel(R) Core(TM)2 Quad CPU

      @ 2.40GHz

Starting sequential...3.459 s.

Starting sequential...3.459 s.

Starting sequential...3.456 s.

Starting sequential...3.465 s.

Starting sequential...3.461 s.

Starting sequential...3.462 s.

Starting concurrent...0.924 s.

Starting concurrent...0.972 s.

Starting concurrent...0.94 s.

Starting concurrent...0.923 s.

Starting concurrent...0.926 s.

Starting concurrent...0.929 s.

However, on an NVIDIA GTX 470, the concurrent tasks run in the same time as the sequential tasks:

$ ocl-task-parallel.exe --platform 0 --children 4 --times 1234567

Number of platforms = 2

Platform profile: FULL_PROFILE

Platform version: OpenCL 1.1 CUDA 4.0.1

Platform name: NVIDIA CUDA

Platform vendor: NVIDIA Corporation

Platform extensions: cl_khr_byte_addressable_store cl_khr_icd cl_khr_gl_sharing

cl_nv_d3d9_sharing cl_nv_d3d10_sharing cl_khr_d3d10_sharing cl_nv_d3d11_sharing

cl_nv_compiler_options cl_nv_device_attribute_query cl_nv_pragma_unroll

devices = 1

            Device [0]

                type                          = CL_DEVICE_TYPE_GPU

                name                          = GeForce GTX 470

Starting sequential...1.988 s.

Starting sequential...1.987 s.

Starting sequential...1.988 s.

Starting sequential...1.987 s.

Starting sequential...1.988 s.

Starting sequential...1.987 s.

Starting concurrent...1.986 s.

Starting concurrent...1.988 s.

Starting concurrent...1.987 s.

Starting concurrent...1.985 s.

Starting concurrent...1.988 s.

Starting concurrent...1.987 s.

Concurrent kernels for Fermi should be possible in CUDA. To test that, I wrote an program in the CUDA Runtime API similar to the OpenCL solution. This code definitely proves that concurrent kernels work on a Fermi (see CUDA Runtime API solution here).

$ cuda-r-task-parallel.exe --children 10 --times 1234567

devices = 1

Starting sequential...0.818 s.

Starting sequential...0.777 s.

Starting sequential...0.777 s.

Starting sequential...0.777 s.

Starting sequential...0.777 s.

Starting sequential...0.777 s.

Starting concurrent...0.079 s.

Starting concurrent...0.08 s.

Starting concurrent...0.079 s.

Starting concurrent...0.079 s.

Starting concurrent...0.08 s.

Starting concurrent...0.08 s.

I’ve run into this same problem myself. I’ve not been able to get concurrent OpenCL kernel execution working across multiple GPUs, if I place all of the GPUs in the same context and just give them separate queues. Instead, I have to create a separate context for each GPU – which kind of defeats the whole point, IMHO.

This issue has cropped up a number of times in the forums. For whatever reason, NVIDIA have been very unresponsive, and are effectively ignoring the problem. Someone from NVIDIA may wish to comment here…