Kernel + buffer reads/writes at the same time Asynchronous kernel execution and host-device data tra

Hi all
is it possible to at the same time, on the same GPU, run a kernel and perform buffer reads and/or writes?
If the answer is yes, how can I synchronize these operations?

In my application I have lots of data and lots of host-device memory transfer. I’d like to implement a sort of pipeline, so I can hide this mem transfer latency with computation.

Thanks a lot

I’ve been looking to hide transfer latencies in my OpenCL application for quite a while and I must say there is not much information regarding that subject. I finally succeeded yesterday. Here are some rules you need to follow:

1- You must use multiple command queues. Task in a single command queue are executed one at a time (in-order or out-of-order).

2- Your transfers must be to/from host pinned memory. This allow the GPU to access it directly on the PCIe bus and use DMA.

3- Using the built-in profiling features (ie. running “performance visual profiler”) will block transfer/kernel parallel execution. That was my main issue…

4- You must use events to explicitly define synchronization between tasks executed on different command queues. Also, you may activate out-of-order execution for a command queues. In that case, you must also take care to define synchronization between tasks in that queue.

I hope it helps.


thanks a lot for your answer!
Really useful information.

What architecture did you do this on? Fermi lists async bus transfer as a capability. If you did this on pre-fermi, then maybe it is doable more ways.

I’m using a Tesla C1060 (“pre-fermi”). As far as I know, the only difference between pre-fermi and fermi (from a memory transfer point of view) is the fact that fermi cards have 2 DMA engines. This would allow 2 memory transfer at the same time while executing a kernel rather than just one. In that case, I presume you would need 3 command queues.

Also note that some older cards supporting gpgpu do not support async transfer during kernel execution. I think there is a fct call to confirm that the feature is present…


Thanks, pobelzile, for your hints. However, it’s still unclear to how to force two different tasks from two different queues to execute concurrently. If I put one task in the event_wait_list of the other, they’ll be executed serially. If I leave event_wait_list empty, the tasks may be executed at any time / order, not necessarily in parallel. Or should I just use clWaitForEvents() and wait for both tasks to finish?

eyebex, do you mean concurrently instead of synchronously?
There is no way to force concurrent transfer and kernel execution since a device might not support it.

Yes, sorry, I fixed that.

I understand, but just like for pinned memory transfers, I’d like to know what I need to do to make it likely that the NVIDIA driver performs the data transfer in parallel to executing a kernel if the hardware supports it.

I realized the GPU Computing SDK 3.2 has a sample called “oclCopyComputeOverlap” which seems to answer all of my questions. The key is to issue explicit calls to clFlush() at the right places for the queues that contain the copy and compute items, respectively.