I’ve been looking to hide transfer latencies in my OpenCL application for quite a while and I must say there is not much information regarding that subject. I finally succeeded yesterday. Here are some rules you need to follow:
1- You must use multiple command queues. Task in a single command queue are executed one at a time (in-order or out-of-order).
2- Your transfers must be to/from host pinned memory. This allow the GPU to access it directly on the PCIe bus and use DMA.
3- Using the built-in profiling features (ie. running “performance visual profiler”) will block transfer/kernel parallel execution. That was my main issue…
4- You must use events to explicitly define synchronization between tasks executed on different command queues. Also, you may activate out-of-order execution for a command queues. In that case, you must also take care to define synchronization between tasks in that queue.
I’m using a Tesla C1060 (“pre-fermi”). As far as I know, the only difference between pre-fermi and fermi (from a memory transfer point of view) is the fact that fermi cards have 2 DMA engines. This would allow 2 memory transfer at the same time while executing a kernel rather than just one. In that case, I presume you would need 3 command queues.
Also note that some older cards supporting gpgpu do not support async transfer during kernel execution. I think there is a fct call to confirm that the feature is present…
Thanks, pobelzile, for your hints. However, it’s still unclear to how to force two different tasks from two different queues to execute concurrently. If I put one task in the event_wait_list of the other, they’ll be executed serially. If I leave event_wait_list empty, the tasks may be executed at any time / order, not necessarily in parallel. Or should I just use clWaitForEvents() and wait for both tasks to finish?
I understand, but just like for pinned memory transfers, I’d like to know what I need to do to make it likely that the NVIDIA driver performs the data transfer in parallel to executing a kernel if the hardware supports it.
I realized the GPU Computing SDK 3.2 has a sample called “oclCopyComputeOverlap” which seems to answer all of my questions. The key is to issue explicit calls to clFlush() at the right places for the queues that contain the copy and compute items, respectively.