Kernel + buffer reads/writes at the same time Asynchronous kernel execution and host-device data tra

odranoel · April 27, 2010, 4:23pm

Hi all
is it possible to at the same time, on the same GPU, run a kernel and perform buffer reads and/or writes?
If the answer is yes, how can I synchronize these operations?

In my application I have lots of data and lots of host-device memory transfer. I’d like to implement a sort of pipeline, so I can hide this mem transfer latency with computation.

Thanks a lot

pobelzile · May 5, 2010, 1:38pm

I’ve been looking to hide transfer latencies in my OpenCL application for quite a while and I must say there is not much information regarding that subject. I finally succeeded yesterday. Here are some rules you need to follow:

1- You must use multiple command queues. Task in a single command queue are executed one at a time (in-order or out-of-order).

2- Your transfers must be to/from host pinned memory. This allow the GPU to access it directly on the PCIe bus and use DMA.

3- Using the built-in profiling features (ie. running “performance visual profiler”) will block transfer/kernel parallel execution. That was my main issue…

4- You must use events to explicitly define synchronization between tasks executed on different command queues. Also, you may activate out-of-order execution for a command queues. In that case, you must also take care to define synchronization between tasks in that queue.

I hope it helps.

P-O

odranoel · May 6, 2010, 5:21pm

Hi,
thanks a lot for your answer!
Really useful information.

jcpalmer · May 7, 2010, 1:32pm

I’ve been looking to hide transfer latencies in my OpenCL application for quite a while and I must say there is not much information regarding that subject. I finally succeeded yesterday. Here are some rules you need to follow:

1- You must use multiple command queues. Task in a single command queue are executed one at a time (in-order or out-of-order).

2- Your transfers must be to/from host pinned memory. This allow the GPU to access it directly on the PCIe bus and use DMA.

3- Using the built-in profiling features (ie. running “performance visual profiler”) will block transfer/kernel parallel execution. That was my main issue…

4- You must use events to explicitly define synchronization between tasks executed on different command queues. Also, you may activate out-of-order execution for a command queues. In that case, you must also take care to define synchronization between tasks in that queue.

I hope it helps.

P-O

What architecture did you do this on? Fermi lists async bus transfer as a capability. If you did this on pre-fermi, then maybe it is doable more ways.

pobelzile · May 11, 2010, 8:45pm

I’m using a Tesla C1060 (“pre-fermi”). As far as I know, the only difference between pre-fermi and fermi (from a memory transfer point of view) is the fact that fermi cards have 2 DMA engines. This would allow 2 memory transfer at the same time while executing a kernel rather than just one. In that case, I presume you would need 3 command queues.

Also note that some older cards supporting gpgpu do not support async transfer during kernel execution. I think there is a fct call to confirm that the feature is present…

P-O

eyebex · December 6, 2010, 2:12pm

Thanks, pobelzile, for your hints. However, it’s still unclear to how to force two different tasks from two different queues to execute concurrently. If I put one task in the event_wait_list of the other, they’ll be executed serially. If I leave event_wait_list empty, the tasks may be executed at any time / order, not necessarily in parallel. Or should I just use clWaitForEvents() and wait for both tasks to finish?

daemonized · December 6, 2010, 4:16pm

eyebex, do you mean concurrently instead of synchronously?
There is no way to force concurrent transfer and kernel execution since a device might not support it.

eyebex · December 6, 2010, 4:22pm

Yes, sorry, I fixed that.

I understand, but just like for pinned memory transfers, I’d like to know what I need to do to make it likely that the NVIDIA driver performs the data transfer in parallel to executing a kernel if the hardware supports it.

eyebex · December 22, 2010, 4:26pm

I realized the GPU Computing SDK 3.2 has a sample called “oclCopyComputeOverlap” which seems to answer all of my questions. The key is to issue explicit calls to clFlush() at the right places for the queues that contain the copy and compute items, respectively.

Topic		Replies	Views
Overlapping kernel execution and memory copy CUDA Programming and Performance	6	9818	September 22, 2007
Asynchronous HtoD memtransfer need to have it asynchronous for cpu, but synchronous for the GPU CUDA Programming and Performance	6	1076	September 9, 2010
Concurrent kernel execution, blocking device to host transfers, and mapped memory CUDA Programming and Performance	2	921	June 7, 2012
Concurrent Kernel and data transfer on multi-GPU systems CUDA Programming and Performance	6	1384	November 6, 2011
Combination of "Overlap of Data Transfer" and "Concurrent Kernel Execution" CUDA Programming and Performance	1	1351	September 14, 2011
CUDA Streams CUDA Programming and Performance	2	3417	December 17, 2009
How to Overlap Data Transfers in CUDA C/C++ Technical Blog	23	2490	January 18, 2023
Asynchronous H2D transfer while kernel execution CUDA Programming and Performance	2	5180	April 26, 2011
Application speedup using multistream Is there an OpenCL exemple using multiple command queues? CUDA Programming and Performance	0	1222	February 15, 2010
asynchronous data tansfer b/w CPU and GPU CUDA Programming and Performance	3	2020	February 16, 2009

Kernel + buffer reads/writes at the same time Asynchronous kernel execution and host-device data tra

Related topics