OpenCL Asynchronous Kernel Launches

It appears that NVidia’s OpenCL implementation is still[1] blocking on calls to clEnqueueNDRangeKernel or possibly on clEnqueueWriteBuffer / clEnqueueReadBuffer.

You can observe this behavior by creating a high-iteration loop that queues a buffer write, followed by a kernel call, followed by a buffer read. Make sure to use the OpenCL wait list arguments and events to make each command wait for the previous one. Outside of the loop (and after the loop), use OpenCL’s wait function to wait for the last read event to complete.

for i = 0; i < LARGE_NUMBER; i++ {
    queue a write, waiting on the last read unless i = 0
    queue a kernel, waiting on the above write
    queue a read, waiting on the above kernel
}

wait on the last read

When I do this, I see the loop that does the queuing take a very long time, and the final wait completes instantly. This lends evidence to the claim that NVidia’s OpenCL implementation is blocking on reads, writes, or kernel launches.

For further evidence, install AMD APP[2] and attempt to run the same code on a CPU (you can install AMD APP without an AMD card). You’ll see the queue loop finish almost instantly, and the program will wait on the wait.

I suppose the other alternative is that NVidia’s queueing algorithm/implementation is slow, the GPU is actually beating it, even for fairly substantial kernels. This seems like it would be a bug in and of itself. However, I doubt this is the case, because if I increase the workload of each kernel, the time to run the queuing loop also increases.
It really ought not matter what hardware I’m running, but just in case it does, I ‘m testing on a Quadro K5000 and a Intel Xeon X5482.

Note that I’m not attempting to run multiple kernels concurrently, merely to queue them asynchronously. Being able to queue kernels asynchronously would allow for a (concurrency-enabled) device to run concurrent kernels with only a single host thread. Which would be, you know, a billion times better than having to fiddle around with synchronizing an apparently-non-conforming OpenCL implementation.

Has anyone else observed this behavior? Is it a problem with my code, or with NVidia’s OpenCL implementation? Have I provided sufficient information?

  1. https://devtalk.nvidia.com/default/topic/415023/launch-kernels-in-parallel-/
  2. http://developer.amd.com/tools-and-sdks/heterogeneous-computing/amd-accelerated-parallel-processing-app-sdk/downloads/

P.S. I sure am glad I write my forum posts in Emacs and not the browser! I forgot a subject line, and after submitting, my post was no longer in the box!

Edit: I forgot to ask a question.

1 Like

Any updates on this?

It seems that this issue is still plaguing the Nvidia drivers.

Is this on Linux, Windows or Mac OS X?

The reason I am asking is that on Windows Vista, 7, 8, 8.1 the WDDM graphics driver model may affect kernel launch behavior somewhat.

My experiences with NV are from Windows. Anecdotal evidence digged from depths of the internet show that linux drivers suffer from the same.

However the OpenGL compute shaders (which are, in the case of single command queue and single GPU, 1:1 with OpenCL, you can convert the kernels almost with just search&replace) work brilliantly with nvidia drivers. Also Intel&AMD drivers work fine with OpenCL.

It looks like, I experience the same problem with a Quadro K4000 for a medical software, where many OpenCL kernels are launched sequentially, where many data transfers occur. The GPU is used to process general graphics as well. K5000 utilizes the GK104 GPU, respectively, the K4000 uses the GK106 GPU, which have CC 3.0 and therefore Hyper-Q is not supported. Although my OpenCL software works fine on a Tesla C2075(Fermi), I believe that problem occurs, because the GPUs have to render graphics and compute kernels simultaneously, which could be a cause for those problems.

I’m definitely seeing this. I’m using a long running kernel so I can see that enqueue does not return until the kernel finishes processing. It didn’t used to work that way. I’ve written OCL code in a previous job that relied on enqueuKernel being asynchronous and it worked.

From what I see, clEnqueueNDRangeKernel is asynchronous. here is my test case:

$ cat t5.cpp
#include <CL/opencl.h>
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <sys/time.h>
#define USECPSEC 1000000ULL

unsigned long long dtime_usec(unsigned long long start=0){

  timeval tv;
  gettimeofday(&tv, 0);
  return ((tv.tv_sec*USECPSEC)+tv.tv_usec)-start;
}

#define TILE_WIDTH 16
#define DS 16384

const char source[] =
"__kernel void matrix_multiply(__global float *A, __global float *B,"
" __global float *C, int width)"
"{"
"     __local float Ashare[TILE_WIDTH][TILE_WIDTH];"
"     __local float Bshare[TILE_WIDTH][TILE_WIDTH];"
"   int bx = get_group_id(0);"
"   int by = get_group_id(1);"
"   int tx = get_local_id(0);"
"   int ty = get_local_id(1);"
"   int row = by * TILE_WIDTH + ty;"
"   int col = bx * TILE_WIDTH + tx;"
"   float result = 0;"
"   for (int m = 0; m < width / TILE_WIDTH; m++) {"
"     Ashare[ty][tx] = A[(row * width) + (m * TILE_WIDTH) + tx];"
"     Bshare[ty][tx] = B[(((m * TILE_WIDTH) + ty) * width) + col];"
"     barrier(CLK_LOCAL_MEM_FENCE); "
"       for (int k = 0; k < TILE_WIDTH; k++) {"
"         result += Ashare[ty][k] * Bshare[k][tx];"
"       }"
"     barrier(CLK_LOCAL_MEM_FENCE); "
"   }"
"   C[(row * width) + col] = result;"
" };"

;

int main(int argc, char *argv[])
{
  cl_platform_id platform;
  cl_device_id device;
  cl_context context;
  cl_command_queue queue1, queue2;
  cl_program program;
  cl_mem mem1, mem2, mem3;
  cl_kernel kernel;
  cl_int err;

  err = clGetPlatformIDs(1, &platform, NULL);
  if (err != CL_SUCCESS) {printf("%d: %d\n", __LINE__, err); return -1;}
  err = clGetDeviceIDs(platform, CL_DEVICE_TYPE_ALL, 1, &device, NULL);
  if (err != CL_SUCCESS) {printf("%d: %d\n", __LINE__, err); return -1;}
  context = clCreateContext(NULL, 1, &device, NULL, NULL, NULL);
  queue1 = clCreateCommandQueue(context, device, CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE, NULL);

  const char *sources[1] = {source};
  program = clCreateProgramWithSource(context, 1, sources, NULL, &err);
  if (err != CL_SUCCESS) {printf("%d: %d\n", __LINE__, err); return -1;}
  err = clBuildProgram(program, 1, &device, "-D TILE_WIDTH=16", NULL, NULL);
  if (err == CL_BUILD_PROGRAM_FAILURE) {
    // Determine the size of the log
    size_t log_size;
    clGetProgramBuildInfo(program, device, CL_PROGRAM_BUILD_LOG, 0, NULL, &log_size);


    // Allocate memory for the log
    char *log = (char *) malloc(log_size);

    // Get the log
    clGetProgramBuildInfo(program, device, CL_PROGRAM_BUILD_LOG, log_size, log, NULL);

    // Print the log
    printf("%s\n", log);
  }

  if (err != CL_SUCCESS) {printf("%d: %d\n", __LINE__, err); return -1;}
  mem1 = clCreateBuffer(context, CL_MEM_READ_WRITE, DS*DS*sizeof(float), NULL, NULL);
  mem2 = clCreateBuffer(context, CL_MEM_READ_WRITE, DS*DS*sizeof(float), NULL, NULL);
  mem3 = clCreateBuffer(context, CL_MEM_READ_WRITE, DS*DS*sizeof(float), NULL, NULL);
  float *hdata = new float[DS*DS];
  for (int i = 0; i < DS*DS; i++) hdata[i] = 1;
  kernel = clCreateKernel(program, "matrix_multiply", &err);
  if (err != CL_SUCCESS) {printf("%d: %d\n", __LINE__, err); return -1;}
  const size_t gwork_size[2] = {DS,DS};
  const size_t lwork_size[2] = {TILE_WIDTH,TILE_WIDTH};
  int msize = DS;
  err = clSetKernelArg(kernel, 0, sizeof(mem1), &mem1);
  if (err != CL_SUCCESS) {printf("%d: %d\n", __LINE__, err); return -1;}
  err = clSetKernelArg(kernel, 1, sizeof(mem2), &mem2);
  if (err != CL_SUCCESS) {printf("%d: %d\n", __LINE__, err); return -1;}
  err = clSetKernelArg(kernel, 2, sizeof(mem3), &mem3);
  if (err != CL_SUCCESS) {printf("%d: %d\n", __LINE__, err); return -1;}
  err = clSetKernelArg(kernel, 3, sizeof(msize), &msize);
  if (err != CL_SUCCESS) {printf("%d: %d\n", __LINE__, err); return -1;}

  err = clEnqueueWriteBuffer(queue1, mem1, CL_TRUE, 0, DS*DS*sizeof(float), hdata, 0, NULL, NULL);
  if (err != CL_SUCCESS) {printf("%d: %d\n", __LINE__, err); return -1;}
  err = clEnqueueWriteBuffer(queue1, mem2, CL_TRUE, 0, DS*DS*sizeof(float), hdata, 0, NULL, NULL);
  if (err != CL_SUCCESS) {printf("%d: %d\n", __LINE__, err); return -1;}
  unsigned long long dt = dtime_usec(0);
  err = clEnqueueNDRangeKernel(queue1, kernel, 2, NULL, gwork_size, lwork_size, 0, NULL, NULL);
  if (err != CL_SUCCESS) {printf("%d: %d\n", __LINE__, err); return -1;}
  dt = dtime_usec(dt);
  printf("elapsed time for kernel enqueue: %fs\n", ((float)dt)/USECPSEC);
  dt = dtime_usec(0);
  err = clEnqueueReadBuffer(queue1, mem3, CL_TRUE, 0, DS*DS*sizeof(float), hdata, 0, NULL, NULL);
  if (err != CL_SUCCESS) {printf("%d: %d\n", __LINE__, err); return -1;}
  dt = dtime_usec(dt);
  printf("elapsed time for kernel completion: %fs\n", ((float)dt)/USECPSEC);
  for (int i = 0; i < DS*DS; i++)
    if (hdata[i] != DS) {printf("error at %d, was %f, should be %f\n", i, hdata[i], (float)DS); return 1;}
  printf("success!\n");
  return 0;
}
$ g++ t5.cpp -I/usr/local/cuda/include -o t5 -L/usr/local/cuda/lib64 -lOpenCL
$ ./t5
elapsed time for kernel enqueue: 0.001562s
elapsed time for kernel completion: 3.495233s
success!
$

CentOS 7, driver 470.57.02, Tesla V100