openCL command queue Overlap

cashtrevor · November 3, 2022, 3:45pm

Hello All,

I am running openCL code on an Nvidia RTX 2080.

I have 2 clCommand queue that I want to overlap in exucution.

Each command queue has kernels that have the same read-write access to data on GPU.

kernels in command queue #1 just read from the data.
kernels in command queue #2. read and write to the data.

The kernels also have access to openCL resources.

I start both command queues at the same time and then synchornize them when their work is finished.

As it is however - the driver waits till command queue#1 is done before starting command queue #2.

I have tried to declare 2 seperate cl_mem objects (1 read-write, and 1 read only) but I get a runtime assert when I do that.

What is the requirement for command queues to execute in parrellel and act on the same global memory?

Thanks in Advance,
-Trevor

Robert_Crovella · November 3, 2022, 8:30pm

If you have launched kernels in one queue and launched kernels in the other queue, without any other information, there is no reason to conclude that those kernels would either overlap or that they would run in an interleaved fashion.

It’s a plausible outcome for the driver to run one queue to completion before starting the other.

If you desire some other ordering, then the way to ensure ordering is to launch things into the same queue, not different queues.

If you desire overlap of kernels from queue1 with kernels from queue2, that may not be possible - the GPU does not necessarily have enough capacity to ensure that any 2 given kernels can overlap.

I’m not likely to spend much time discussing this at length without a complete code to study. I find that code answers lots of questions, and discussions are much more difficult without it.

cashtrevor · November 3, 2022, 11:19pm

Hi Robert,

Yes my goal is to have overlap of a kernel in queue1 with kernels in queue2.

I attached a simplified source file (openCL_pipeline.cpp) showing how the pipeline runs.
openCL_pipeline.cpp (2.6 KB)

There are 7 kernels running sequentially in queue1 , and 1 kernel running in queue2:

queue1: [action_kernel - > preupdate_kernel → preupdate_kernel_2 → game_updatepre1_kernel → update_kernel → update2_kernel → post_update_kernel]

queue2:[gui_kernel]

The Stage1_Begin() function starts the execution of both queues. and Stage1_End() waits for both queues to finish (synchronize)

The gui_kernel can run 2-15ms and (on other hardware) usually stops executing about when the update_kernel is running on queue1. As it is, gui_kernel runs first and then action_kernel.

The full source file with buffer creation and context creation is also attached (GameGPUCompute.cpp)
GameGPUCompute.cpp (27.4 KB)
GameGPUCompute.h (4.5 KB)

Robert_Crovella · November 9, 2022, 11:25pm

I don’t know of any “special” requirements to have overlap of two kernels issued to two separate command queues. Here is a simple example:

$ cat t13.cpp
#include <CL/opencl.h>
#include <stdint.h>
#include <stdio.h>
#include <inttypes.h>
#include <stdlib.h>

const char source[] =
"__kernel void test_rotate(__global ulong *d_count, ulong loops, ulong patt)"
"{"
"  ulong n = patt;"
"  for (ulong i = 0; i<loops; i++)"
"    n &= (107 << (patt+(i%7)));"
"  d_count[0] = n + loops;"
"}"
;

int main(int argc, char *argv[])
{
  cl_platform_id platform;
  cl_device_id device;
  cl_context context;
  cl_command_queue queue1, queue2;
  cl_program program;
  cl_mem mem1, mem2;
  cl_kernel kernel;

  bool two_kernels = false;
  unsigned long long loops = 1000;
  if (argc > 1) loops *= atoi(argv[1]);
  if (argc > 2) two_kernels = true;
  if (two_kernels) printf("running two kernels\n");
  else printf("running one kernel\n");
  printf("running  %lu loops\n", loops);
  unsigned long long pattern = 1;
  clGetPlatformIDs(1, &platform, NULL);
  clGetDeviceIDs(platform, CL_DEVICE_TYPE_ALL, 1, &device, NULL);
  context = clCreateContext(NULL, 1, &device, NULL, NULL, NULL);
  queue1 = clCreateCommandQueue(context, device, CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE, NULL);
  queue2 = clCreateCommandQueue(context, device, CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE, NULL);

  const char *sources[1] = {source};
  program = clCreateProgramWithSource(context, 1, sources, NULL, NULL);
  clBuildProgram(program, 1, &device, NULL, NULL, NULL);
  mem1 = clCreateBuffer(context, CL_MEM_READ_WRITE, 1*sizeof(cl_ulong), NULL, NULL);
  mem2 = clCreateBuffer(context, CL_MEM_READ_WRITE, 1*sizeof(cl_ulong), NULL, NULL);
  kernel = clCreateKernel(program, "test_rotate", NULL);
  const size_t work_size[1] = {1};
  clSetKernelArg(kernel, 0, sizeof(mem1), &mem1);
  clSetKernelArg(kernel, 1, sizeof(loops), &loops);
  clSetKernelArg(kernel, 2, sizeof(pattern), &pattern);

  clEnqueueNDRangeKernel(queue1, kernel, 1, NULL, work_size, work_size, 0, NULL, NULL);
  if (two_kernels){
    clSetKernelArg(kernel, 0, sizeof(mem2), &mem2);
    clSetKernelArg(kernel, 1, sizeof(loops), &loops);
    clSetKernelArg(kernel, 2, sizeof(pattern), &pattern);

    clEnqueueNDRangeKernel(queue2, kernel, 1, NULL, work_size, work_size, 0, NULL, NULL);
    }
  cl_ulong *buf1 = (cl_ulong *)clEnqueueMapBuffer(queue1, mem1, true, CL_MAP_READ, 0, 1*sizeof(cl_ulong), 0, NULL, NULL, NULL);
  cl_ulong *buf2 = (cl_ulong *)clEnqueueMapBuffer(queue2, mem2, true, CL_MAP_READ, 0, 1*sizeof(cl_ulong), 0, NULL, NULL, NULL);
  printf("result1: %lu\n", buf1[0]);
  printf("result2: %lu\n", buf2[0]);
  clEnqueueUnmapMemObject(queue1, mem1, buf1, 0, NULL, NULL);
  clEnqueueUnmapMemObject(queue2, mem2, buf2, 0, NULL, NULL);
  return 0;
}
$ g++ t13.cpp -o t13 -I/usr/local/cuda/include -lOpenCL
$ time ./t13
running one kernel
running  1000 loops
result1: 1000
result2: 0

real    0m0.452s
user    0m0.049s
sys     0m0.384s
$ time ./t13 100000
running one kernel
running  100000000 loops
result1: 100000000
result2: 0

real    0m4.236s
user    0m2.620s
sys     0m1.599s
$ time ./t13 100000 1
running two kernels
running  100000000 loops
result1: 100000000
result2: 100000000

real    0m4.199s
user    0m2.673s
sys     0m1.514s
$

(CUDA 11.4, CentOS 7, Tesla V100, driver 470.57.02)

I conclude from the above test case that the two kernels are running concurrently, i.e. overlapped.

Note: Although I happened to have CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE set on both queues in the above example, that isn’t necessary for this simple demonstration, and you’ll get a similar result if you replace those flags with 0.

Topic		Replies	Views
launch kernels in parallel? CUDA Programming and Performance	16	23981	July 29, 2010
Concurrent Kernel Execution CUDA Programming and Performance	6	2161	July 13, 2011
No speedup from run 2 kernels concurrently on a gpu device of compute capability 2.0 CUDA Programming and Performance	4	1533	October 10, 2011
Is task parallel programming (aka "concurrent kernels") in OpenCL supported? CUDA Programming and Performance	1	4287	August 31, 2011
[SOLVED] What causes my OpenCL kernel serialized when running on multiple GPUs? CUDA Programming and Performance kernel	1	855	August 8, 2020
OpenCL Asynchronous Kernel Launches CUDA Programming and Performance	6	2708	May 24, 2022
CUDA/OpenCL runs multiple GPUs sequentially CUDA Programming and Performance	16	19321	November 26, 2015
CKE CUDA Programming and Performance	8	2946	July 21, 2011
Problem with get_global_id(1); CUDA Programming and Performance	5	4006	May 19, 2014
OpenCL and Ubuntu 10.10 CUDA Programming and Performance	7	80074	January 25, 2011

openCL command queue Overlap

Related topics