openCL command queue Overlap

Hello All,

I am running openCL code on an Nvidia RTX 2080.

I have 2 clCommand queue that I want to overlap in exucution.

Each command queue has kernels that have the same read-write access to data on GPU.

kernels in command queue #1 just read from the data.
kernels in command queue #2. read and write to the data.

The kernels also have access to openCL resources.

I start both command queues at the same time and then synchornize them when their work is finished.

As it is however - the driver waits till command queue#1 is done before starting command queue #2.

I have tried to declare 2 seperate cl_mem objects (1 read-write, and 1 read only) but I get a runtime assert when I do that.

What is the requirement for command queues to execute in parrellel and act on the same global memory?

Thanks in Advance,

If you have launched kernels in one queue and launched kernels in the other queue, without any other information, there is no reason to conclude that those kernels would either overlap or that they would run in an interleaved fashion.

It’s a plausible outcome for the driver to run one queue to completion before starting the other.

If you desire some other ordering, then the way to ensure ordering is to launch things into the same queue, not different queues.

If you desire overlap of kernels from queue1 with kernels from queue2, that may not be possible - the GPU does not necessarily have enough capacity to ensure that any 2 given kernels can overlap.

I’m not likely to spend much time discussing this at length without a complete code to study. I find that code answers lots of questions, and discussions are much more difficult without it.

Hi Robert,

Yes my goal is to have overlap of a kernel in queue1 with kernels in queue2.

I attached a simplified source file (openCL_pipeline.cpp) showing how the pipeline runs.
openCL_pipeline.cpp (2.6 KB)

There are 7 kernels running sequentially in queue1 , and 1 kernel running in queue2:

queue1: [action_kernel - > preupdate_kernel → preupdate_kernel_2 → game_updatepre1_kernel → update_kernel → update2_kernel → post_update_kernel]


The Stage1_Begin() function starts the execution of both queues. and Stage1_End() waits for both queues to finish (synchronize)

The gui_kernel can run 2-15ms and (on other hardware) usually stops executing about when the update_kernel is running on queue1. As it is, gui_kernel runs first and then action_kernel.

The full source file with buffer creation and context creation is also attached (GameGPUCompute.cpp)
GameGPUCompute.cpp (27.4 KB)
GameGPUCompute.h (4.5 KB)

I don’t know of any “special” requirements to have overlap of two kernels issued to two separate command queues. Here is a simple example:

$ cat t13.cpp
#include <CL/opencl.h>
#include <stdint.h>
#include <stdio.h>
#include <inttypes.h>
#include <stdlib.h>

const char source[] =
"__kernel void test_rotate(__global ulong *d_count, ulong loops, ulong patt)"
"  ulong n = patt;"
"  for (ulong i = 0; i<loops; i++)"
"    n &= (107 << (patt+(i%7)));"
"  d_count[0] = n + loops;"

int main(int argc, char *argv[])
  cl_platform_id platform;
  cl_device_id device;
  cl_context context;
  cl_command_queue queue1, queue2;
  cl_program program;
  cl_mem mem1, mem2;
  cl_kernel kernel;

  bool two_kernels = false;
  unsigned long long loops = 1000;
  if (argc > 1) loops *= atoi(argv[1]);
  if (argc > 2) two_kernels = true;
  if (two_kernels) printf("running two kernels\n");
  else printf("running one kernel\n");
  printf("running  %lu loops\n", loops);
  unsigned long long pattern = 1;
  clGetPlatformIDs(1, &platform, NULL);
  clGetDeviceIDs(platform, CL_DEVICE_TYPE_ALL, 1, &device, NULL);
  context = clCreateContext(NULL, 1, &device, NULL, NULL, NULL);
  queue1 = clCreateCommandQueue(context, device, CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE, NULL);
  queue2 = clCreateCommandQueue(context, device, CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE, NULL);

  const char *sources[1] = {source};
  program = clCreateProgramWithSource(context, 1, sources, NULL, NULL);
  clBuildProgram(program, 1, &device, NULL, NULL, NULL);
  mem1 = clCreateBuffer(context, CL_MEM_READ_WRITE, 1*sizeof(cl_ulong), NULL, NULL);
  mem2 = clCreateBuffer(context, CL_MEM_READ_WRITE, 1*sizeof(cl_ulong), NULL, NULL);
  kernel = clCreateKernel(program, "test_rotate", NULL);
  const size_t work_size[1] = {1};
  clSetKernelArg(kernel, 0, sizeof(mem1), &mem1);
  clSetKernelArg(kernel, 1, sizeof(loops), &loops);
  clSetKernelArg(kernel, 2, sizeof(pattern), &pattern);

  clEnqueueNDRangeKernel(queue1, kernel, 1, NULL, work_size, work_size, 0, NULL, NULL);
  if (two_kernels){
    clSetKernelArg(kernel, 0, sizeof(mem2), &mem2);
    clSetKernelArg(kernel, 1, sizeof(loops), &loops);
    clSetKernelArg(kernel, 2, sizeof(pattern), &pattern);

    clEnqueueNDRangeKernel(queue2, kernel, 1, NULL, work_size, work_size, 0, NULL, NULL);
  cl_ulong *buf1 = (cl_ulong *)clEnqueueMapBuffer(queue1, mem1, true, CL_MAP_READ, 0, 1*sizeof(cl_ulong), 0, NULL, NULL, NULL);
  cl_ulong *buf2 = (cl_ulong *)clEnqueueMapBuffer(queue2, mem2, true, CL_MAP_READ, 0, 1*sizeof(cl_ulong), 0, NULL, NULL, NULL);
  printf("result1: %lu\n", buf1[0]);
  printf("result2: %lu\n", buf2[0]);
  clEnqueueUnmapMemObject(queue1, mem1, buf1, 0, NULL, NULL);
  clEnqueueUnmapMemObject(queue2, mem2, buf2, 0, NULL, NULL);
  return 0;
$ g++ t13.cpp -o t13 -I/usr/local/cuda/include -lOpenCL
$ time ./t13
running one kernel
running  1000 loops
result1: 1000
result2: 0

real    0m0.452s
user    0m0.049s
sys     0m0.384s
$ time ./t13 100000
running one kernel
running  100000000 loops
result1: 100000000
result2: 0

real    0m4.236s
user    0m2.620s
sys     0m1.599s
$ time ./t13 100000 1
running two kernels
running  100000000 loops
result1: 100000000
result2: 100000000

real    0m4.199s
user    0m2.673s
sys     0m1.514s

(CUDA 11.4, CentOS 7, Tesla V100, driver 470.57.02)

I conclude from the above test case that the two kernels are running concurrently, i.e. overlapped.

Note: Although I happened to have CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE set on both queues in the above example, that isn’t necessary for this simple demonstration, and you’ll get a similar result if you replace those flags with 0.