launch kernels in parallel?

Hi!

I tried to launch two kernels in parallel by calling “clEnqueueNDRangeKernel” for both kernels and expected this function to immediately return after the kernel has been enqueued, but “clEnqueueNDRangeKernel” only returns after the kernel has completed. Obviously no parallel invocation is possible this way. The command stream has the CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE property set, and this property is supported for the device (GTX280) according to “clGetDeviceInfo”. The read/write/map functions have a flag to select blocking/non-blocking operation, is there a similar mechanism to enforce a non-blocking kernel launch?

Thanks & kind regards,
Markus

To my knowledge, there is not flag because they shouldn’t ever block. I don’t know why you’re seeing this behavior. Out of order queueing isn’t necessary, either.

Very interesting… can you confirm that a kernel call (clEnqueueNDRangeKernel) is non-blocking on your system? Which platform are you using?

Thanks,

Markus

A simple example to demonstrate this behavior is attached. It uses the OpenCL profiling operations to record the timing, and it is evident from the program’s output that processing is strictly sequential (the numbers are nanoseconds after queuing the first kernel):

kernel 0:

command queued: 0

command submit: 0

command start: 34496

command end: 925968896

kernel 1:

command queued: 927198144

command submit: 927198144

command start: 927212640

command end: 1853145888

Kernel 1 gets queued only after kernel 0 has completed. So I have two questions:

*) Does OpenCL support parallel execution of different kernels (or more than one instance of the same kernel as in this example)?

*) If yes, how can two kernels be launched to execute concurrently?

Thanks & kind regards,

Markus
concurrency.zip (1.11 KB)

I’m not sure about this but I would be surprised if it was possible to launch concurrent kernels on GPUs. You can’t do that in CUDA. Unless you mean just async queuing?

Yes, but is this a restriction of the CUDA API/driver or the underlying hardware? In the latter case, it obviously won’t work in OpenCL either.

This would be useful even if the kernels are processed sequentially, but in the example I posted recently, queuing is sequential as well. On the other hand, a CUDA kernel can be launched asynchronously, so I expected the same to be possible in OpenCL (and hoped that it could even run kernels in parallel).

Kind regards,

Markus

The restriction in concurrent kernel executions is imposed by the hardware AFAIK.

You should be able to queue kernels asynchronously, if I read the specs correctly, but don’t ask me how, I’m a newbie to OpenCL :)

Yes, I also can’t find any blocking requirements for clEnqueueNDRangeKernel in the specs. Seems like a bug in Nvidia’s OpenCL implementation, or do I overlook something?

Kind regards,

Markus

Yah, I can demonstrate that enqueuing a kernel in OpenCL doesn’t block. I actually have to wait on the returned event before I copy memory otherwise I get stale data. I can’t say what I’m using, but it isn’t the NVIDIA implementation.

Concurrent execution with a single gpu does not apply to two kernels but to the execution of a kernel parallel to a memory transfer operation.
e.g. Parallel execution of a ‘clEnqueueNDRangeKernel’ and ‘clEnqueueCopyBuffer’ commands.

I did not manage to demonstrate this behavior on my machine.

Did anybody manage to demonstrate the concurrent execution ?

[edit] - After a few tries i managed to perform concurrent copies.

Hi All,

I’ve hit this same problem with the version 3 of the CUDA Tollkit and GPU Computing SDK (my driver version is 197.16) - it seems that clEnqueueNDRangeKernel blocks until the kernel has completed execution. This makes it impossible for a single thread on the host to enqueue work for multiple devices.

In the Jumpstart Guide (http://developer.download.nvidia.com/OpenCL/NVIDIA_OpenCL_JumpStart_Guide.pdf) on page 13 it states, “Both kernel launch functions (CUDA and OpenCL) are asynchronous, i.e. they return immediately after scheduling the kernel to be executed on the GPU.” It appears that this statement is incorrect.

If anyone from NVidia is reading, are you planning on changing this behavior in subsequent releases? That is, can we expect to be able to enqueue OpenCL kernels for asynchronous execution in later versions of your OpenCL implementation?

Thanks,

Dan

Just a little more evidence that clEnqueueNDRangeKernel() blocks, I modified the oclSimpleMultiGPU sample provided in the GPU Computing SDK so that it reported command start and end times when profiling is enabled rather than just execution duration. I also modified the kernel code so that it took much longer to execute. The output when using two GPUs shows quite clearly that the kernel for the second GPU does not start executing until the kernel running on the first GPU completes its calculation. The relevant output is as follows (times shown are in microseconds are the three number are, in order of appearance, start, end and duration):

[codebox]

Profiling Information for GPU Processing:

Device 0 : GeForce GTX 260M

Reduce Kernel : 189901803 222986345 33084542 us

Copy Device->Host : 222987730 256593873 33606143 us

Device 1 : GeForce GTX 260M

Reduce Kernel : 227082505 260682302 33599797 us

Copy Device->Host : 260684612 260684871 258 us

[/codebox]

Notice that on Device 1, the kernel start time (227,082,505) is greater than the kernel end time for device 0 (222,986,345). Furthermore, watching the pretty graphs produced by GPUz (on the Sensors page), you can see that times that GPUs are under load are mutually exclusive.

I’m pretty unimpressed that NVidia supplied an example of using multiple GPUs that does not achieve concurrent execution of kernels on separate devices. What exactly is meant to be demonstrated by the example?

Given the blocking nature of clEnqueueNDRangeKernel(), what strategies are used by others to achieve concurrent execution of kernels on separate devices?

Cheers,

Dan

Hi all,

I’ve had some discussion over at the Khronos forums and the consensus is that a blocking clEnqueueNDRangeKernel is a bug. See this thread:

[url="http://www.khronos.org/message_boards/viewtopic.php?f=28&t=1990"]http://www.khronos.org/message_boards/view...f=28&t=1990[/url]

And this post is this thread where it was suggested that, “As far as your comment on being surprised that clEnqueueNDRangeKernel on some implementations is blocking, I would suggest that you file a bug and work with the vendor in question to resolve this issue. It is certainly the intent of the spec and I know more than one implementation where this is not the case.”

[url="http://www.khronos.org/message_boards/viewtopic.php?p=7321#p7321"]http://www.khronos.org/message_boards/view...hp?p=7321#p7321[/url]

I could not work out where to file a bug report - is this forum the place to do it, or is there an issue tracking system that I should be using?

Cheers,

Dan

If you look at the Tesla info sheet on page 2 you will see:

  • NVIDIA GIGATHREAD ENGINE
    Maximizes the throughput by faster context switching that is 10X faster than previous architecture, concurrent kernel execution, and improved thread block scheduling.

Source: [url=“http://www.nvidia.com/docs/IO/43395/NV_DS_...final_lores.pdf”]http://www.nvidia.com/docs/IO/43395/NV_DS_...final_lores.pdf[/url]

I would not change to out of order execution lightly without testing on a GF100, I think.

No one is changing anything, except maybe NVidia to fall in line with the OpenCL 1.0 spec. clEnqueueNDRangeKernel must not block - though not stated explicitly, I am convinced that this can be deduced by reading various parts of the spec. That is, unless NVidia does not want to support concurrent execution of kernels on separate devices??

Cheers,

Dan

I’ve searched the kronos and nvidia forums. There are several rumors of how concurrent kernel execution works in openCL. But most discuss how data transfer and kernel exec can run concurrently, or running multiple gpu devices concurrently. The concurrent kernel execution in the Fermi white paper meant multiple kernels running in one device. As I have many small kernels throughout a main algorithm, it would help immensely for GPU not to wait serially for each small kernel.

Haven’t found it, but has the opencl 1.1/latest included this feature yet?

My setup:

[codebox]===================================================

========================

Platform ‘NVIDIA CUDA’

============================================================

===============

EXTENSIONS: cl_khr_byte_addressable_store cl_khr_icd cl_khr_gl_sharing cl_nv_d3d

9_sharing cl_nv_d3d10_sharing cl_khr_d3d10_sharing cl_nv_d3d11_sharing cl_nv_com

piler_options cl_nv_device_attribute_query cl_nv_pragma_unroll

NAME: NVIDIA CUDA

PROFILE: FULL_PROFILE

VENDOR: NVIDIA Corporation

VERSION: OpenCL 1.0 CUDA 3.1.1


Device ‘GeForce GTX 470’


MAX_COMPUTE_UNITS: 14

…[/codebox]

According to the whitepaper, the max kernels should be 16 (same as SM?), so GTX470 allows 14 kernels.

Testing with 2 kernels. Trying OUT_OF_ORDER gave no differnece.

Kernel call (clEnqueueNDRangeKernel) and read_buffer are indeed non-blocking (this is an optional flag defaulting to nonblocking). It’s tested by querying event1.profile.end at various times (throwing error if event1 is not finished by then). It’s observed a kernel may finish before/after the next kernel starts. But for equal sized kernels, they always finish in-order.

I don’t know how accurate the event.profile is, but testing real time by python’s time(), 2 kernels takes exactly 2x the time of 1 kernel (up to 1% variation). I tried various combinations of 2 kernels on 1 queue, 2 queues, 2 contexts & kernels, and wait/no-wait. Asking all CL functions to wait is perhaps <0.5% slower than no waits, but usually both are identical. I’m not sure why when non of the functions block. Perhaps data is too small/fast.

Input is 100k 3D points, the kernel has a forloop of 20k repeats of the same L2-norm operation. That’s 1E09 distance operations. 20k repeats don’t use any extra dataspace, but increasing repeats or the points, causes Fermi to black out and crash the driver–exceeded memory? 1 kernel takes about 1.71 seconds.

The other combinations behave similarly, adding more time-overhead higher up the hierarchy. I even hastily tried 2 separate python interpreters to guarantee thread independence. 2 kernels should take 3.4 seconds. 2 pythons running 4 kernels take 7 seconds. The 2nd python starts between 1-2 seconds after the 1st.

Besides some simplifications, pyopencl is one-to-one function wrapper to the same C code, and supposedly never blocks until the wait() before printing results. From these it’s clear at least kernels execution is serial on one device. This perhaps happens internally in the GPU/driver as one queue, as a poster suggested (from another topic).

Can a Nvidia developer point to a working sample code of parallel kernels on one device? Or how to use this Fermi feature?

Probably the watchdog timer, on my Windows 7 it was set to 2 seconds by default.

A kernel running for more than a second is definitely long enough to be significant and not overshadowed by API overheads.

I don’t know how to execute kernels concurrently on a single Fermi. I’m not even sure that’s possible in the current implementation of OpenCL. If you find a way, let me know please.